Email :  |  Classe : M1 Bioinformatique - parcours OSB et MISO
R : 4.4.x  |  ️ RStudio : 2024.09+  |  Bioconductor : 3.20



1 Introduction

1.1 Contexte Biologique

1.1.1 Le Cancer Colorectal (CRC)

Le cancer colorectal est l’un des cancers les plus fréquents dans le monde :

Statistiques mondiales (2020) : - 3ème cancer le plus diagnostiqué (1.9 million de cas/an) - 2ème cause de décès par cancer (~900,000 décès/an) - Incidence croissante chez les jeunes (<50 ans) - Survie à 5 ans : 65% (tous stades), 90% (stade I), 14% (stade IV)

1.1.1.1 Séquence Adénome-Carcinome

Le CRC se développe sur 10-15 ans selon ce modèle:

Épithélium normal
    ↓ Mutation APC
Polype bénin (adénome)
    ↓ Mutation KRAS
Adénome dysplasique
    ↓ Mutation TP53, SMAD4
Carcinome in situ
    ↓ Invasion stroma
Carcinome invasif
    ↓ Dissémination
Métastases (foie, poumon)

1.1.1.2 Mutations Clés et Voies Altérées

Gène/Voie Type Fréquence Fonction Impact
APC Suppresseur 80% Wnt/β-catenin Initiation
KRAS Oncogène 40% MAPK signaling Progression
TP53 Suppresseur 50% Apoptose, cycle Progression
PIK3CA Oncogène 15% PI3K/Akt/mTOR Prolifération
SMAD4 Suppresseur 10% TGF-β Invasion
BRAF Oncogène 10% MAPK (V600E) Mauvais pronostic

1.1.2 Intérêt de l’Analyse Transcriptomique

L’analyse RNA-seq du CRC permet de :

Identifier les gènes dérégulés (DEG)
Classifier les tumeurs en sous-types moléculaires (CMS1-4)
Découvrir de nouveaux biomarqueurs diagnostiques
Prédire la réponse thérapeutique
Proposer des cibles thérapeutiques


1.2 ️ Le Projet TCGA-COAD

1.2.1 The Cancer Genome Atlas (TCGA)

TCGA (2006-2018) : Plus grand effort de caractérisation moléculaire du cancer

  • 33 types de cancer analysés
  • >11,000 patients séquencés
  • Multi-omics : WGS, RNA-seq, Méthylation, CNV, Protéomique
  • Open access : Données publiques sur GDC Portal

1.2.2 Dataset Utilisé dans ce TD

Projet : TCGA Colon Adenocarcinoma (TCGA-COAD)

Technologie :
- RNA-seq Illumina HiSeq 2000/2500
- Paired-end 2×75bp ou 2×100bp
- Alignement : STAR v2.7
- Quantification : HTSeq-Counts

Échantillons :
- ~480 échantillons au total dans TCGA - Pour ce TD : Sous-ensemble filtré et pré-traité - Tumeurs primaires (Primary Solid Tumor) - Tissus adjacents normaux (Solid Tissue Normal)

Gènes :
- ~60,000 features (ensemble Ensembl) - ~20,000 gènes codants (protein-coding)

1.2.2.1 Publication de Référence

The Cancer Genome Atlas Network. (2012).
Comprehensive molecular characterization of human colon and rectal cancer.
Nature, 487(7407), 330-337.
DOI: 10.1038/nature11252 Xena dataset : [https://xenabrowser.net/datapages/?dataset=TCGA.COAD.sampleMap/HiSeqV2&host=https%3A%2F%2Ftcga.xenahubs.net]

Principaux Résultats TCGA-COAD : - 16% des tumeurs : hypermutées (MSI-H, POLE mutations) - 84% des tumeurs : non-hypermutées (chromosomal instability) - Voie WNT dérégulée dans 93% des cas - Identification de 4 sous-types CMS (Consensus Molecular Subtypes)


1.3 Objectifs Pédagogiques du TD

À la fin de ce TD, vous serez capable de :

** COMPÉTENCES TECHNIQUES**

  1. Charger et inspecter des données RNA-seq réelles (TCGA-COAD)
  2. Réaliser un contrôle qualité complet :
    • Distributions des comptages (boxplot, density, violin)
    • PCA (PC1/PC2, scree plot, version interactive)
    • Clustering hiérarchique et heatmap des distances
  3. Effectuer une analyse différentielle avec DESeq2 :
    • Normalisation, estimation de dispersion, test de Wald
    • Extraction et filtrage des gènes significatifs (padj, log2FC)
  4. Créer des visualisations professionnelles :
    • MA plots
    • Volcano plots
    • Heatmaps hiérarchiques
    • Count plots de gènes individuels
  5. Interpréter les statistiques : p-value, FDR (padj), log2FC, baseMean
  6. Effectuer des enrichissements fonctionnels :
    • Gene Ontology BP (ORA globale + séparée Up/Down)
    • KEGG Pathways (ORA)
    • GSEA sur KEGG (running score)
    • Reactome Pathways
  7. Comparer ORA vs GSEA (principes, avantages, NES)

** COMPÉTENCES BIOLOGIQUES**

  1. Interpréter les résultats dans le contexte du cancer colorectal (CRC)
  2. Identifier les voies biologiques dérégulées (Wnt, MAPK, PI3K, cycle cellulaire)
  3. Proposer des gènes candidats pour validation expérimentale (qRT-PCR, Western Blot)

2 Installation et Chargement des Packages

2.1 Installation des Packages (si nécessaire)

** CONSEIL** : Exécutez cette section UNE SEULE FOIS sur votre ordinateur.
Si les packages sont déjà installés, passez directement au chargement.

# Installer BiocManager (gestionnaire de packages Bioconductor)
if (!requireNamespace("BiocManager", quietly = TRUE))
    install.packages("BiocManager")

# BIOCONDUCTOR
BiocManager::install(c(
    "DESeq2",              # Analyse différentielle
    "org.Hs.eg.db",        # Annotations Homo sapiens
    "AnnotationDbi",       # Interface annotations
    "clusterProfiler",     # Enrichissement GO, KEGG
    "enrichplot",          # Visualisations enrichissement
    "ReactomePA",          # Reactome pathways
    "EnhancedVolcano"      # Volcano plots
))

# CRAN
install.packages(c(
    "tidyverse",           # ggplot2, dplyr, etc.
    "pheatmap",            # Heatmaps
    "patchwork",           # Combiner plots
    "plotly",              # Plots interactifs
    "viridis",             # Palettes de couleurs
    "dendextend"           # Dendrogrammes colorés
))

2.2 Chargement des Packages

# CHARGEMENT DES LIBRARIES

# Manipulation de données
library(tidyverse)         # ggplot2, dplyr, tidyr, etc.

# Analyse RNA-seq
library(DESeq2)            # Analyse différentielle
library(EnhancedVolcano)
# Annotations
library(org.Hs.eg.db)      # Annotations humaines
library(AnnotationDbi)     # Interface

# Enrichissement
library(clusterProfiler)   # GO, KEGG enrichment
library(enrichplot)        # Visualisations
library(ReactomePA)        # Reactome pathways

# Visualisations
library(pheatmap)          # Heatmaps
library(patchwork)
# Couleurs
library(RColorBrewer)      # Palettes
library(viridis)           # Palettes accessibles

# PCA
library(FactoMineR)        # PCA avancée
library(factoextra)        # Visualisation PCA

# Paramètres ggplot par défaut
theme_set(theme_minimal(base_size = 14))

# Désactiver notation scientifique
options(scipen = 999)

cat("les packages chargés avec succès!\n")
#> les packages chargés avec succès!

2.3 Information de Session

# Afficher les versions (important pour reproductibilité)
sessionInfo()
#> R version 4.3.1 (2023-06-16 ucrt)
#> Platform: x86_64-w64-mingw32/x64 (64-bit)
#> Running under: Windows 11 x64 (build 26100)
#> 
#> Matrix products: default
#> 
#> 
#> locale:
#> [1] LC_COLLATE=French_France.utf8  LC_CTYPE=French_France.utf8   
#> [3] LC_MONETARY=French_France.utf8 LC_NUMERIC=C                  
#> [5] LC_TIME=French_France.utf8    
#> 
#> time zone: Europe/Paris
#> tzcode source: internal
#> 
#> attached base packages:
#> [1] stats4    stats     graphics  grDevices utils     datasets  methods  
#> [8] base     
#> 
#> other attached packages:
#>  [1] factoextra_1.0.7            FactoMineR_2.11            
#>  [3] viridis_0.6.5               viridisLite_0.4.3          
#>  [5] RColorBrewer_1.1-3          patchwork_1.3.2            
#>  [7] pheatmap_1.0.13             ReactomePA_1.46.0          
#>  [9] enrichplot_1.22.0           clusterProfiler_4.10.1     
#> [11] org.Hs.eg.db_3.18.0         AnnotationDbi_1.64.1       
#> [13] EnhancedVolcano_1.20.0      ggrepel_0.9.6              
#> [15] DESeq2_1.42.1               SummarizedExperiment_1.32.0
#> [17] Biobase_2.62.0              MatrixGenerics_1.14.0      
#> [19] matrixStats_1.5.0           GenomicRanges_1.54.1       
#> [21] GenomeInfoDb_1.38.8         IRanges_2.36.0             
#> [23] S4Vectors_0.40.2            BiocGenerics_0.48.1        
#> [25] lubridate_1.9.4             forcats_1.0.1              
#> [27] stringr_1.6.0               dplyr_1.1.4                
#> [29] purrr_1.0.4                 readr_2.1.5                
#> [31] tidyr_1.3.1                 tibble_3.2.1               
#> [33] ggplot2_4.0.2               tidyverse_2.0.0            
#> 
#> loaded via a namespace (and not attached):
#>   [1] rstudioapi_0.18.0       jsonlite_2.0.0          magrittr_2.0.3         
#>   [4] estimability_1.5.1      farver_2.1.2            rmarkdown_2.30         
#>   [7] fs_1.6.5                zlibbioc_1.48.2         vctrs_0.6.5            
#>  [10] memoise_2.0.1           RCurl_1.98-1.17         ggtree_3.10.1          
#>  [13] htmltools_0.5.8.1       S4Arrays_1.2.1          SparseArray_1.2.4      
#>  [16] gridGraphics_0.5-1      sass_0.4.9              bslib_0.10.0           
#>  [19] htmlwidgets_1.6.4       plyr_1.8.9              emmeans_2.0.1          
#>  [22] cachem_1.1.0            uuid_1.2-1              igraph_2.1.4           
#>  [25] lifecycle_1.0.5         pkgconfig_2.0.3         gson_0.1.0             
#>  [28] Matrix_1.5-4.1          R6_2.6.1                fastmap_1.2.0          
#>  [31] GenomeInfoDbData_1.2.11 digest_0.6.37           aplot_0.2.9            
#>  [34] colorspace_2.1-1        RSQLite_2.3.9           timechange_0.3.0       
#>  [37] httr_1.4.8              polyclip_1.10-7         abind_1.4-8            
#>  [40] compiler_4.3.1          bit64_4.6.0-1           withr_3.0.2            
#>  [43] graphite_1.48.0         S7_0.2.0                BiocParallel_1.36.0    
#>  [46] DBI_1.2.3               ggforce_0.4.2           MASS_7.3-60            
#>  [49] rappdirs_0.3.3          DelayedArray_0.28.0     scatterplot3d_0.3-44   
#>  [52] HDO.db_0.99.1           flashClust_1.01-2       tools_4.3.1            
#>  [55] otel_0.2.0              scatterpie_0.2.6        ape_5.8-1              
#>  [58] glue_1.8.0              nlme_3.1-168            GOSemSim_2.28.1        
#>  [61] shadowtext_0.1.6        grid_4.3.1              cluster_2.1.8.1        
#>  [64] reshape2_1.4.4          fgsea_1.28.0            generics_0.1.4         
#>  [67] gtable_0.3.6            tzdb_0.5.0              data.table_1.17.0      
#>  [70] hms_1.1.4               tidygraph_1.3.1         XVector_0.42.0         
#>  [73] pillar_1.11.1           yulab.utils_0.2.4       splines_4.3.1          
#>  [76] tweenr_2.0.3            treeio_1.26.0           lattice_0.22-7         
#>  [79] bit_4.6.0               tidyselect_1.2.1        GO.db_3.18.0           
#>  [82] locfit_1.5-9.12         Biostrings_2.70.3       reactome.db_1.86.2     
#>  [85] knitr_1.51              gridExtra_2.3           xfun_0.52              
#>  [88] graphlayouts_1.2.2      DT_0.34.0               stringi_1.8.7          
#>  [91] lazyeval_0.2.2          ggfun_0.2.0             yaml_2.3.7             
#>  [94] evaluate_1.0.5          codetools_0.2-20        ggraph_2.2.1           
#>  [97] qvalue_2.34.0           graph_1.80.0            multcompView_0.1-11    
#> [100] ggplotify_0.1.3         cli_3.6.4               xtable_1.8-4           
#> [103] systemfonts_1.2.2       jquerylib_0.1.4         Rcpp_1.0.14            
#> [106] png_0.1-8               parallel_4.3.1          leaps_3.2              
#> [109] blob_1.3.0              DOSE_3.28.2             bitops_1.0-9           
#> [112] mvtnorm_1.3-3           tidytree_0.4.7          ggiraph_0.8.13         
#> [115] scales_1.4.0            crayon_1.5.3            rlang_1.1.5            
#> [118] cowplot_1.2.0           fastmatch_1.1-6         KEGGREST_1.42.0

3 Préparation de l’Espace de Travail

3.1 Création des Dossiers

# Créer structure de dossiers
dir.create("results", showWarnings = FALSE)
dir.create("results/tables", showWarnings = FALSE, recursive = TRUE)
dir.create("results/enrichment", showWarnings = FALSE, recursive = TRUE)

dir.create("figures", showWarnings = FALSE)
dir.create("figures/QC", showWarnings = FALSE, recursive = TRUE)
dir.create("figures/DEG", showWarnings = FALSE, recursive = TRUE)
dir.create("figures/enrichment", showWarnings = FALSE, recursive = TRUE)

cat("Structure de dossiers créée:\n")
#> Structure de dossiers créée:
cat("   results/\n")
#>    results/
cat("   ├─ tables/       (tableaux CSV/Excel)\n")
#>    ├─ tables/       (tableaux CSV/Excel)
cat("   └─ enrichment/   (résultats GO/KEGG)\n")
#>    └─ enrichment/   (résultats GO/KEGG)
cat("   figures/\n")
#>    figures/
cat("   ├─ QC/           (PCA, clustering, distributions)\n")
#>    ├─ QC/           (PCA, clustering, distributions)
cat("   ├─ DEG/          (Volcano, MA, heatmaps)\n")
#>    ├─ DEG/          (Volcano, MA, heatmaps)
cat("   └─ enrichment/   (dotplots, barplots, GSEA)\n")
#>    └─ enrichment/   (dotplots, barplots, GSEA)

4 Chargement et Exploration des Données

4.1 Chargement des Données

** FICHIERS REQUIS**

Assurez-vous d’avoir ces deux fichiers dans votre répertoire de travail : - COAD_counts_matrix.rds : Matrice de comptages (gènes × échantillons) - COAD_metadata.rds : Métadonnées des échantillons

# Charger les données
cat(" Chargement des données TCGA-COAD...\n")
#>  Chargement des données TCGA-COAD...
counts_matrix <- readRDS("COAD_counts_matrix.rds")
metadata <- readRDS("COAD_metadata.rds")

cat("Données chargées avec succès!\n\n")
#> Données chargées avec succès!
# Vérification rapide
cat("=== DIMENSIONS ===\n")
#> === DIMENSIONS ===
cat("Matrice de comptages:", nrow(counts_matrix), "gènes ×", ncol(counts_matrix), "échantillons\n")
#> Matrice de comptages: 16892 gènes × 327 échantillons
cat("Métadonnées:", nrow(metadata), "échantillons ×", ncol(metadata), "variables\n")
#> Métadonnées: 327 échantillons × 4 variables

4.2 Inspection des Données

4.2.1 Structure de la Matrice de Comptages

# Premières lignes/colonnes
cat("=== APERÇU MATRICE DE COMPTAGES ===\n")
#> === APERÇU MATRICE DE COMPTAGES ===
counts_matrix[1:10, 1:5]
# Type de données
cat("\nType de données:", class(counts_matrix), "\n")
#> 
#> Type de données: data.frame
cat("Type de valeurs:", class(counts_matrix[1,1]), "\n")
#> Type de valeurs: numeric
# Convertir en matrice numérique
counts_mat <- as.matrix(counts_matrix)

# Statistiques globales
cat("Comptage minimum:", min(counts_mat), "\n")
#> Comptage minimum: 0
cat("Comptage maximum:", format(max(counts_mat), big.mark=","), "\n")
#> Comptage maximum: 1,336,102
cat("Comptage moyen:", round(mean(counts_mat), 1), "\n")
#> Comptage moyen: 1131.2
cat("Comptage médian:", median(counts_mat), "\n")
#> Comptage médian: 340
# Proportion de zéros
prop_zeros <- sum(counts_mat == 0) / (nrow(counts_mat) * ncol(counts_mat))

QUESTION 1a : Combien d’échantillons tumoraux et normaux avons-nous?
(Répondre dans le document Word séparé)

4.2.2 Structure des Métadonnées

# Aperçu des métadonnées
cat("=== APERÇU MÉTADONNÉES ===\n")
#> === APERÇU MÉTADONNÉES ===
head(metadata, 10)
# Structure
cat("\n=== STRUCTURE ===\n")
#> 
#> === STRUCTURE ===
str(metadata)
#> 'data.frame':    327 obs. of  4 variables:
#>  $ sample_id : chr  "TCGA-CA-5256-01" "TCGA-AZ-6599-01" "TCGA-AA-3655-01" "TCGA-A6-6137-01" ...
#>  $ barcode   : chr  "TCGA-CA-5256-01" "TCGA-AZ-6599-01" "TCGA-AA-3655-01" "TCGA-A6-6137-01" ...
#>  $ condition : chr  "Tumor" "Tumor" "Tumor" "Tumor" ...
#>  $ patient_id: chr  "TCGA-CA-5256" "TCGA-AZ-6599" "TCGA-AA-3655" "TCGA-A6-6137" ...
# Résumé statistique
cat("\n=== RÉSUMÉ ===\n")
#> 
#> === RÉSUMÉ ===
summary(metadata)
#>   sample_id           barcode           condition          patient_id       
#>  Length:327         Length:327         Length:327         Length:327        
#>  Class :character   Class :character   Class :character   Class :character  
#>  Mode  :character   Mode  :character   Mode  :character   Mode  :character

4.2.3 Distribution des Groupes

# Compter les échantillons par groupe
table_groups <- table(metadata$condition)

cat("=== DISTRIBUTION DES GROUPES ===\n")
#> === DISTRIBUTION DES GROUPES ===
print(table_groups)
#> 
#> Normal  Tumor 
#>     41    286
# Pourcentages
prop_groups <- prop.table(table_groups) * 100
cat("\nPourcentages:\n")
#> 
#> Pourcentages:
print(round(prop_groups, 1))
#> 
#> Normal  Tumor 
#>   12.5   87.5
# Visualisation
barplot(table_groups, 
        col = c("steelblue", "coral"),
        main = "Distribution des Échantillons",
        ylab = "Nombre d'échantillons",
        las = 1)

QUESTION 1b : Pourquoi y a-t-il beaucoup plus de tumeurs que de normaux dans TCGA?


4.3 Vérification de la Correspondance

# Vérifier que les échantillons correspondent
samples_counts <- colnames(counts_matrix)
samples_metadata <- rownames(metadata)

cat("=== CORRESPONDANCE ÉCHANTILLONS ===\n")
#> === CORRESPONDANCE ÉCHANTILLONS ===
cat("Échantillons dans counts:", length(samples_counts), "\n")
#> Échantillons dans counts: 327
cat("Échantillons dans metadata:", length(samples_metadata), "\n")
#> Échantillons dans metadata: 327
# Vérification stricte
all_match <- all(samples_counts %in% samples_metadata) && 
             all(samples_metadata %in% samples_counts)

if(all_match) {
  cat("Tous les échantillons correspondent parfaitement!\n")
} else {
  cat(" Attention: Certains échantillons ne correspondent pas!\n")
  
  # Identifier les différences
  missing_in_metadata <- setdiff(samples_counts, samples_metadata)
  missing_in_counts <- setdiff(samples_metadata, samples_counts)
  
  if(length(missing_in_metadata) > 0) {
    cat("Manquants dans metadata:", length(missing_in_metadata), "\n")
  }
  if(length(missing_in_counts) > 0) {
    cat("Manquants dans counts:", length(missing_in_counts), "\n")
  }
}
#> Tous les échantillons correspondent parfaitement!
# Réordonner metadata pour correspondre exactement à counts
metadata <- metadata[colnames(counts_matrix), , drop = FALSE]
cat("\nMétadonnées réordonnées pour correspondre à la matrice de comptages\n")
#> 
#> Métadonnées réordonnées pour correspondre à la matrice de comptages

5 Contrôle Qualité et Exploration

5.1 Distribution des Comptages Totaux par Échantillon

5.1.1 Library Sizes (Profondeur de séquençage)

# Calculer les comptages totaux par échantillon
library_sizes <- colSums(counts_matrix)

# Créer un dataframe pour ggplot
df_lib_sizes <- data.frame(
  Sample = names(library_sizes),
  Library_Size = library_sizes,
  Sample_Type = metadata$condition
)

# Statistiques
cat("=== LIBRARY SIZES ===\n")
#> === LIBRARY SIZES ===
cat("Minimum:", format(min(library_sizes), big.mark=","), "reads\n")
#> Minimum: 16,313,130 reads
cat("Maximum:", format(max(library_sizes), big.mark=","), "reads\n")
#> Maximum: 26,263,500 reads
cat("Moyenne:", format(round(mean(library_sizes)), big.mark=","), "reads\n")
#> Moyenne: 19,108,199 reads
cat("Médiane:", format(median(library_sizes), big.mark=","), "reads\n")
#> Médiane: 18,871,360 reads
# Par groupe
cat("\n=== PAR GROUPE ===\n")
#> 
#> === PAR GROUPE ===
aggregate(Library_Size ~ Sample_Type, data = df_lib_sizes, FUN = function(x) {
  c(Mean = mean(x), Median = median(x), SD = sd(x))
})
### Visualisation: Boxplot
# Boxplot
p1 <- ggplot(df_lib_sizes, aes(x = Sample_Type, y = Library_Size, fill = Sample_Type)) +
  geom_boxplot(outlier.shape = 21, outlier.size = 2, alpha = 0.7) +
  geom_jitter(width = 0.2, alpha = 0.3, size = 1) +
  scale_fill_manual(values = c("Normal" = "steelblue", "Tumor" = "coral")) +
  scale_y_continuous(labels = scales::comma) +
  labs(
    title = "Library Sizes par Groupe",
    subtitle = "Distribution des comptages totaux par échantillon",
    x = "Type d'échantillon",
    y = "Comptages totaux (reads)",
    fill = "Type"
  ) +
  theme_minimal(base_size = 14) +
  theme(legend.position = "none")

print(p1)

# Sauvegarder
ggsave("figures/QC/library_sizes_boxplot.pdf", p1, width = 10, height = 6)
cat("Figure sauvegardée: figures/QC/library_sizes_boxplot.pdf\n")
#> Figure sauvegardée: figures/QC/library_sizes_boxplot.pdf

QUESTION 2 : Observez-vous une différence dans les comptages totaux entre tumeur et normal?
Qu’est-ce que cela pourrait indiquer?

5.1.2 Visualisation: Distribution Density

# Density plot
p2 <- ggplot(df_lib_sizes, aes(x = Library_Size, fill = Sample_Type)) +
  geom_density(alpha = 0.5) +
  scale_fill_manual(values = c("Normal" = "steelblue", "Tumor" = "coral")) +
  scale_x_continuous(labels = scales::comma) +
  labs(
    title = "Distribution des Library Sizes",
    x = "Comptages totaux (reads)",
    y = "Densité",
    fill = "Type"
  ) +
  theme_minimal(base_size = 14)

print(p2)

ggsave("figures/QC/library_sizes_density.pdf", p2, width = 10, height = 6)

5.2 Distribution des Comptages par Gène

5.2.1 Gènes Détectés par Échantillon

# Compter gènes avec au moins 1 read par échantillon
genes_detected <- colSums(counts_matrix > 0)

df_genes <- data.frame(
  Sample = names(genes_detected),
  Genes_Detected = genes_detected,
  Sample_Type = metadata$condition
)

# Statistiques
cat("=== GÈNES DÉTECTÉS ===\n")
#> === GÈNES DÉTECTÉS ===
cat("Minimum:", min(genes_detected), "gènes\n")
#> Minimum: 15495 gènes
cat("Maximum:", max(genes_detected), "gènes\n")
#> Maximum: 16811 gènes
cat("Moyenne:", round(mean(genes_detected)), "gènes\n")
#> Moyenne: 16318 gènes
# Par groupe
cat("\n=== PAR GROUPE ===\n")
#> 
#> === PAR GROUPE ===
aggregate(Genes_Detected ~ Sample_Type, data = df_genes, FUN = mean)

5.2.2 Visualisation

p3 <- ggplot(df_genes, aes(x = Sample_Type, y = Genes_Detected, fill = Sample_Type)) +
  geom_violin(alpha = 0.7) +
  geom_boxplot(width = 0.2, fill = "white", outlier.shape = NA) +
  scale_fill_manual(values = c("Normal" = "steelblue", "Tumor" = "coral")) +
  labs(
    title = "Gènes Détectés par Échantillon",
    x = "Type d'échantillon",
    y = "Nombre de gènes détectés (counts > 0)",
    fill = "Type"
  ) +
  theme_minimal(base_size = 14) +
  theme(legend.position = "none")

print(p3)

ggsave("figures/QC/genes_detected.pdf", p3, width = 10, height = 6)

5.3 Filtrage des Gènes Faiblement Exprimés

️ POURQUOI FILTRER?

Les gènes avec très peu de reads : - N’ont pas assez de puissance statistique - Augmentent le nombre de tests multiples (pénalité FDR) - Peuvent être du bruit technique

Critère standard : Garder les gènes avec ≥10 reads dans ≥X échantillons
(où X = plus petit groupe, ici les normaux)

# Nombre minimum d'échantillons (taille du plus petit groupe)
min_samples <- min(table(metadata$condition))

cat("=== FILTRAGE DES GÈNES ===\n")
#> === FILTRAGE DES GÈNES ===
cat("Critère: ≥10 reads dans au moins", min_samples, "échantillons\n\n")
#> Critère: ≥10 reads dans au moins 41 échantillons
cat("Avant filtrage:", nrow(counts_matrix), "gènes\n")
#> Avant filtrage: 16892 gènes
# Appliquer le filtre
keep <- rowSums(counts_matrix >= 10) >= min_samples
counts_filtered <- counts_matrix[keep, ]

cat("Après filtrage:", nrow(counts_filtered), "gènes\n")
#> Après filtrage: 16023 gènes
cat("Gènes supprimés:", nrow(counts_matrix) - nrow(counts_filtered), 
    "(", round((1 - nrow(counts_filtered)/nrow(counts_matrix))*100, 1), "%)\n")
#> Gènes supprimés: 869 ( 5.1 %)
# Visualiser l'effet du filtrage
df_filter <- data.frame(
  Category = c("Avant filtrage", "Après filtrage"),
  N_Genes = c(nrow(counts_matrix), nrow(counts_filtered))
)

barplot(df_filter$N_Genes, 
        names.arg = df_filter$Category,
        col = c("lightgray", "steelblue"),
        main = "Effet du Filtrage",
        ylab = "Nombre de gènes",
        las = 1)


6 Création de l’Objet DESeq2

6.1 Construction de l’Objet DESeqDataSet

cat("=== CRÉATION DE L'OBJET DESeq2 ===\n\n")
#> === CRÉATION DE L'OBJET DESeq2 ===
# Vérifier que sample_type est un facteur
if(!is.factor(metadata$condition)) {
  metadata$condition <- factor(metadata$condition)
}

# Définir le niveau de référence (Normal en premier)
metadata$sample_type <- relevel(metadata$condition, ref = "Normal")

cat("Niveaux du facteur sample_type:", levels(metadata$condition), "\n")
#> Niveaux du facteur sample_type: Normal Tumor
cat("Niveau de référence:", levels(metadata$condition)[1], "\n\n")
#> Niveau de référence: Normal
# Créer l'objet DESeqDataSet
dds <- DESeqDataSetFromMatrix(
  countData = counts_filtered,
  colData = metadata,
  design = ~ condition
)

cat("Objet DESeqDataSet créé!\n\n")
#> Objet DESeqDataSet créé!
# Informations
cat("=== INFORMATIONS OBJET DDS ===\n")
#> === INFORMATIONS OBJET DDS ===
print(dds)
#> class: DESeqDataSet 
#> dim: 16023 327 
#> metadata(1): version
#> assays(1): counts
#> rownames(16023): ARHGEF10L HIF3A ... SELP SELS
#> rowData names(0):
#> colnames(327): TCGA-CA-5256-01 TCGA-AZ-6599-01 ... TCGA-CA-5796-01
#>   TCGA-G4-6626-01
#> colData names(5): sample_id barcode condition patient_id sample_type

6.2 Normalisation et Transformation

6.2.1 Transformation VST (Variance Stabilizing Transformation)

** POURQUOI VST?**

RNA-seq a une variance dépendant de la moyenne : - Gènes peu exprimés → variance élevée - Gènes très exprimés → variance faible

VST stabilise la variance pour permettre : - PCA - Clustering hiérarchique - Heatmaps - Visualisations

IMPORTANT : VST est pour VISUALISATION uniquement!
L’analyse différentielle utilise les comptages BRUTS.

cat("=== TRANSFORMATION VST ===\n")
#> === TRANSFORMATION VST ===
cat("Cette étape peut prendre quelques minutes...\n\n")
#> Cette étape peut prendre quelques minutes...
# VST avec blind=TRUE (ignore le design expérimental)
vst_data <- vst(dds, blind = TRUE)

cat("Transformation VST terminée!\n\n")
#> Transformation VST terminée!
# Extraire la matrice transformée
vst_counts <- assay(vst_data)

cat("Dimensions données VST:", dim(vst_counts), "\n")
#> Dimensions données VST: 16023 327
cat("Range des valeurs VST:", round(range(vst_counts), 2), "\n")
#> Range des valeurs VST: 5.39 20.43

QUESTION 3a : Pourquoi utilise-t-on une transformation VST au lieu des comptages bruts pour la PCA?

QUESTION 3b : Que signifie le paramètre blind = TRUE?


7 Analyse en Composantes Principales (PCA)

7.1 PCA sur Données VST

cat("=== ANALYSE PCA ===\n\n")
#> === ANALYSE PCA ===
# Calculer la PCA (transposer car PCA veut échantillons en lignes)
pca_result <- prcomp(t(vst_counts), center = TRUE, scale. = FALSE)

# Variance expliquée
var_explained <- round(100 * (pca_result$sdev^2 / sum(pca_result$sdev^2)), 2)

cat("Variance expliquée par les 10 premières PC:\n")
#> Variance expliquée par les 10 premières PC:
print(var_explained[1:10])
#>  [1] 20.26 11.06  6.99  3.30  2.89  2.48  2.12  1.96  1.69  1.41
cat("\n")
cat("PC1 + PC2:", var_explained[1] + var_explained[2], "%\n")
#> PC1 + PC2: 31.32 %
cat("PC1 + PC2 + PC3:", sum(var_explained[1:3]), "%\n")
#> PC1 + PC2 + PC3: 38.31 %
# Créer dataframe pour ggplot
pca_data <- data.frame(
  Sample = rownames(pca_result$x),
  PC1 = pca_result$x[, 1],
  PC2 = pca_result$x[, 2],
  PC3 = pca_result$x[, 3],
  PC4 = pca_result$x[, 4],
  Sample_Type = metadata$condition,
  Library_Size = colSums(counts_filtered)
)

7.2 Visualisation PCA: PC1 vs PC2

# Plot PC1 vs PC2
p_pca <- ggplot(pca_data, aes(x = PC1, y = PC2, color = Sample_Type, shape = Sample_Type)) +
  geom_point(size = 4, alpha = 0.7) +
  scale_color_manual(values = c("Normal" = "steelblue", "Tumor" = "coral")) +
  scale_shape_manual(values = c("Normal" = 16, "Tumor" = 17)) +
  labs(
    title = "PCA: Tumeurs vs Tissus Normaux",
    subtitle = paste0("PC1 (", var_explained[1], "%) vs PC2 (", var_explained[2], "%)"),
    x = paste0("PC1 (", var_explained[1], "% de variance)"),
    y = paste0("PC2 (", var_explained[2], "% de variance)"),
    color = "Type d'échantillon",
    shape = "Type d'échantillon"
  ) +
  theme_minimal(base_size = 14) +
  theme(
    legend.position = "right",
    panel.grid.major = element_line(color = "gray90"),
    panel.border = element_rect(color = "black", fill = NA)
  ) +
  # Ajouter ellipses de confiance
  stat_ellipse(aes(fill = Sample_Type), alpha = 0.2, geom = "polygon", level = 0.95) +
  scale_fill_manual(values = c("Normal" = "steelblue", "Tumor" = "coral"), guide = "none")

print(p_pca)

ggsave("figures/QC/PCA_PC1_PC2.pdf", p_pca, width = 12, height = 10)
cat("Figure sauvegardée: figures/QC/PCA_PC1_PC2.pdf\n")
#> Figure sauvegardée: figures/QC/PCA_PC1_PC2.pdf

QUESTION 4a : Les tumeurs et les normaux sont-ils bien séparés sur la PCA?

QUESTION 4b : Sur quelle composante principale observe-t-on la meilleure séparation?

QUESTION 4c : Que représente la variance expliquée par PC1 et PC2?

QUESTION 4d : Y a-t-il des échantillons outliers visibles?


7.3 PCA: Autres Combinaisons de PCs

# PC1 vs PC3
p_pc1_pc3 <- ggplot(pca_data, aes(x = PC1, y = PC3, color = Sample_Type)) +
  geom_point(size = 3, alpha = 0.7) +
  scale_color_manual(values = c("Normal" = "steelblue", "Tumor" = "coral")) +
  labs(
    title = paste0("PC1 (", var_explained[1], "%) vs PC3 (", var_explained[3], "%)"),
    x = paste0("PC1 (", var_explained[1], "%)"),
    y = paste0("PC3 (", var_explained[3], "%)")
  ) +
  theme_minimal()

# PC2 vs PC3
p_pc2_pc3 <- ggplot(pca_data, aes(x = PC2, y = PC3, color = Sample_Type)) +
  geom_point(size = 3, alpha = 0.7) +
  scale_color_manual(values = c("Normal" = "steelblue", "Tumor" = "coral")) +
  labs(
    title = paste0("PC2 (", var_explained[2], "%) vs PC3 (", var_explained[3], "%)"),
    x = paste0("PC2 (", var_explained[2], "%)"),
    y = paste0("PC3 (", var_explained[3], "%)")
  ) +
  theme_minimal()

# Combiner avec patchwork
p_combined_pca <- p_pc1_pc3 / p_pc2_pc3 + 
  plot_layout(guides = "collect") & 
  theme(legend.position = "bottom")

print(p_combined_pca)

ggsave("figures/QC/PCA_other_combinations.pdf", p_combined_pca, width = 12, height = 12)

7.4 Scree Plot: Variance Expliquée

# Dataframe pour scree plot
df_scree <- data.frame(
  PC = paste0("PC", 1:20),
  Variance = var_explained[1:20],
  PC_num = 1:20
)

# Scree plot
p_scree <- ggplot(df_scree, aes(x = PC_num, y = Variance)) +
  geom_line(color = "steelblue", size = 1) +
  geom_point(color = "steelblue", size = 3) +
  geom_text(aes(label = paste0(round(Variance, 1), "%")), 
            vjust = -0.5, size = 3) +
  scale_x_continuous(breaks = 1:20, labels = df_scree$PC) +
  labs(
    title = "Scree Plot: Variance Expliquée par Composante Principale",
    x = "Composante Principale",
    y = "Variance Expliquée (%)"
  ) +
  theme_minimal(base_size = 14) +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

print(p_scree)

ggsave("figures/QC/PCA_scree_plot.pdf", p_scree, width = 10, height = 6)

8 Clustering Hiérarchique

8.1 Matrice de Distance

cat("=== CALCUL MATRICE DE DISTANCE ===\n\n")
#> === CALCUL MATRICE DE DISTANCE ===
# Calculer distances euclidiennes entre échantillons
sample_dists <- dist(t(vst_counts), method = "euclidean")

# Convertir en matrice pour heatmap
sample_dist_matrix <- as.matrix(sample_dists)

cat("Dimensions matrice de distance:", dim(sample_dist_matrix), "\n")
#> Dimensions matrice de distance: 327 327
cat("Distance minimum:", round(min(sample_dist_matrix[sample_dist_matrix > 0]), 2), "\n")
#> Distance minimum: 34.36
cat("Distance maximum:", round(max(sample_dist_matrix), 2), "\n")
#> Distance maximum: 218.46

8.2 Heatmap des Distances Entre Échantillons

# Annotations pour la heatmap
annotation_col <- data.frame(
  Type = metadata$condition,
  row.names = rownames(metadata)
)

# Couleurs annotations
annotation_colors <- list(
  Type = c(Normal = "steelblue", Tumor = "coral")
)

# Heatmap avec pheatmap
p_dist <- pheatmap(
  sample_dist_matrix,
  clustering_distance_rows = sample_dists,
  clustering_distance_cols = sample_dists,
  annotation_col = annotation_col,
  annotation_row = annotation_col,
  annotation_colors = annotation_colors,
  color = viridis(100, option = "C"),
  main = "Distances Entre Échantillons (VST)",
  fontsize = 8,
  show_rownames = FALSE,
  show_colnames = FALSE
)

p_dist

pdf("figures/QC/sample_distance_heatmap.pdf", width = 12, height = 12)
grid::grid.draw(p_dist$gtable)
dev.off()
#> png 
#>   2
cat("Heatmap sauvegardée: figures/QC/sample_distance_heatmap.pdf\n")
#> Heatmap sauvegardée: figures/QC/sample_distance_heatmap.pdf

8.3 Dendrogramme

library(dendextend)

# Factoriser les conditions
metadata$condition <- factor(metadata$condition, levels = c("Normal", "Tumor"))
group_colors <- c("Normal" = "steelblue", "Tumor" = "coral")

# Séparer les matrices par condition
counts_normal <- counts_matrix[, metadata$condition == "Normal"]
counts_tumor  <- counts_matrix[, metadata$condition == "Tumor"]

metadata_normal <- metadata[metadata$condition == "Normal", ]
metadata_tumor  <- metadata[metadata$condition == "Tumor", ]

# ========================================
# Dendrogramme Normal
# ========================================
sample_dists_normal <- dist(t(counts_normal))
hc_normal <- hclust(sample_dists_normal, method = "ward.D2")
dend_normal <- as.dendrogram(hc_normal)

leaf_colors_normal <- rep(group_colors["Normal"], ncol(counts_normal))
labels_colors(dend_normal) <- leaf_colors_normal

# Affichage dans le notebook
par(mar = c(5,4,4,2))
plot(dend_normal,
     main = "Dendrogramme - Normal",
     ylab = "Distance",
     leaflab = "none")
colored_bars(colors = leaf_colors_normal, dend = dend_normal, rowLabels = "Condition")

pdf("figures/QC/dendrogram_normal.pdf", width = 14, height = 8)
par(mar = c(5,4,4,2))
plot(dend_normal,
     main = "Dendrogramme - Normal",
     ylab = "Distance",
     leaflab = "none")
colored_bars(colors = leaf_colors_normal, dend = dend_normal, rowLabels = "Condition")
dev.off()
#> png 
#>   2
cat("Dendrogramme Normal sauvegardé: figures/QC/dendrogram_normal.pdf\n")
#> Dendrogramme Normal sauvegardé: figures/QC/dendrogram_normal.pdf
# ========================================
# Dendrogramme Tumor
# ========================================
sample_dists_tumor <- dist(t(counts_tumor))
hc_tumor <- hclust(sample_dists_tumor, method = "ward.D2")
dend_tumor <- as.dendrogram(hc_tumor)

leaf_colors_tumor <- rep(group_colors["Tumor"], ncol(counts_tumor))
labels_colors(dend_tumor) <- leaf_colors_tumor

# Affichage
par(mar = c(5,4,4,2))
plot(dend_tumor,
     main = "Dendrogramme - Tumor",
     ylab = "Distance",
     leaflab = "none")
colored_bars(colors = leaf_colors_tumor, dend = dend_tumor, rowLabels = "Condition")

pdf("figures/QC/dendrogram_tumor.pdf", width = 14, height = 8)
par(mar = c(5,4,4,2))
plot(dend_tumor,
     main = "Dendrogramme - Tumor",
     ylab = "Distance",
     leaflab = "none")
colored_bars(colors = leaf_colors_tumor, dend = dend_tumor, rowLabels = "Condition")
dev.off()
#> png 
#>   2
cat("Dendrogramme Tumor sauvegardé: figures/QC/dendrogram_tumor.pdf\n")
#> Dendrogramme Tumor sauvegardé: figures/QC/dendrogram_tumor.pdf

QUESTION 5a : Les échantillons tumoraux se regroupent-ils ensemble?

QUESTION 5b : Même question pour les normaux?

QUESTION 5c : Identifiez-vous des sous-groupes parmi les tumeurs?

QUESTION 5d : Comment expliquez-vous cette hétérogénéité?


9 Analyse Différentielle avec DESeq2

9.1 Exécution de DESeq2

ÉTAPE CRITIQUE

DESeq2 effectue 3 étapes : 1. Estimation des facteurs de taille (normalisation) 2. Estimation de la dispersion (variance par gène) 3. Tests statistiques (Wald test ou LRT)

Cette étape peut prendre 5-15 minutes selon la taille du dataset et votre ordinateur.

cat("=== EXÉCUTION DE DESeq2 ===\n")
#> === EXÉCUTION DE DESeq2 ===
cat("Cette étape peut prendre plusieurs minutes...\n\n")
#> Cette étape peut prendre plusieurs minutes...
# Exécuter l'analyse complète
dds <- DESeq(dds)

cat("Annalyse DESeq2 terminée!\n\n")
#> Annalyse DESeq2 terminée!
# Afficher les coefficients du modèle
cat("Coefficients du modèle:\n")
#> Coefficients du modèle:
resultsNames(dds)
#> [1] "Intercept"                 "condition_Tumor_vs_Normal"

9.2 Extraction des Résultats

cat("=== EXTRACTION DES RÉSULTATS ===\n\n")
#> === EXTRACTION DES RÉSULTATS ===
# Extraire les résultats (comparaison Tumor vs Normal)
res <- results(dds, 
               contrast = c("condition", "Tumor", "Normal"),
               alpha = 0.05)  # Seuil FDR

cat("Comparaison effectuée: Tumor vs Normal\n")
#> Comparaison effectuée: Tumor vs Normal
cat("Seuil FDR (alpha):", 0.05, "\n\n")
#> Seuil FDR (alpha): 0.05
# Résumé
cat("=== RÉSUMÉ DES RÉSULTATS ===\n")
#> === RÉSUMÉ DES RÉSULTATS ===
summary(res)
#> 
#> out of 16023 with nonzero total read count
#> adjusted p-value < 0.05
#> LFC > 0 (up)       : 6414, 40%
#> LFC < 0 (down)     : 5951, 37%
#> outliers [1]       : 0, 0%
#> low counts [2]     : 0, 0%
#> (mean count < 3)
#> [1] see 'cooksCutoff' argument of ?results
#> [2] see 'independentFiltering' argument of ?results
# Ordonner par p-value ajustée
res_ordered <- res[order(res$padj), ]

cat("\n=== TOP 10 GÈNES (padj le plus faible) ===\n")
#> 
#> === TOP 10 GÈNES (padj le plus faible) ===
head(as.data.frame(res_ordered), 10)

9.3 Filtrage des Gènes Significatifs

cat("=== FILTRAGE GÈNES DIFFÉRENTIELLEMENT EXPRIMÉS ===\n\n")
#> === FILTRAGE GÈNES DIFFÉRENTIELLEMENT EXPRIMÉS ===
# Critères: padj < 0.05 ET |log2FC| > 1
threshold_padj <- 0.05
threshold_log2fc <- 1

# Gènes significatifs
res_sig <- res[which(res$padj < threshold_padj & abs(res$log2FoldChange) > threshold_log2fc), ]
res_sig <- res_sig[!is.na(res_sig$padj), ]  # Enlever les NA

cat("Critères:\n")
#> Critères:
cat("  - padj <", threshold_padj, "\n")
#>   - padj < 0.05
cat("  - |log2FC| >", threshold_log2fc, "\n\n")
#>   - |log2FC| > 1
cat("Gènes différentiellement exprimés:", nrow(res_sig), "\n")
#> Gènes différentiellement exprimés: 4316
# Sur-exprimés vs sous-exprimés
up_regulated <- res_sig[res_sig$log2FoldChange > threshold_log2fc, ]
down_regulated <- res_sig[res_sig$log2FoldChange < -threshold_log2fc, ]

cat("\nSur-exprimés (Tumor > Normal):", nrow(up_regulated), "\n")
#> 
#> Sur-exprimés (Tumor > Normal): 1994
cat("Sous-exprimés (Tumor < Normal):", nrow(down_regulated), "\n")
#> Sous-exprimés (Tumor < Normal): 2322
# Ratio
ratio <- round(nrow(up_regulated) / nrow(down_regulated), 2)
cat("\nRatio Up/Down:", ratio, "\n")
#> 
#> Ratio Up/Down: 0.86

QUESTION 6a : Combien de gènes sont différentiellement exprimés au total?

QUESTION 6b : Observe-t-on plus de gènes sur-exprimés ou sous-exprimés?

QUESTION 6c : Que suggère ce déséquilibre sur la biologie du cancer colorectal?


9.4 Conversion en DataFrame et Ajout d’Annotations

# Convertir en dataframe
res_df <- as.data.frame(res)
res_df$gene <- rownames(res_df)

# Ajouter annotations (symboles de gènes)
library(AnnotationDbi)
library(org.Hs.eg.db)

# Mapper Ensembl IDs vers symboles
res_df$symbol <- mapIds(org.Hs.eg.db,
                        keys = res_df$gene,
                        column = "SYMBOL",
                        keytype = "SYMBOL",
                        multiVals = "first")

# Ajouter nom complet du gène
res_df$genename <- mapIds(org.Hs.eg.db,
                          keys = res_df$gene,
                          column = "GENENAME",
                          keytype = "SYMBOL",
                          multiVals = "first")

# Aperçu
cat("=== DATAFRAME AVEC ANNOTATIONS ===\n")
#> === DATAFRAME AVEC ANNOTATIONS ===
head(res_df[, c("gene", "symbol", "genename", "log2FoldChange", "padj")], 10)
# Même chose pour les gènes significatifs
res_sig_df <- as.data.frame(res_sig)
res_sig_df$gene <- rownames(res_sig_df)
res_sig_df$symbol <- mapIds(org.Hs.eg.db,
                            keys = res_sig_df$gene,
                            column = "SYMBOL",
                            keytype = "SYMBOL",
                            multiVals = "first")

9.5 Sauvegarder les Résultats

# Sauvegarder tous les résultats
write.csv(res_df, 
          "results/tables/DESeq2_results_all_genes.csv",
          row.names = FALSE)

# Sauvegarder les gènes significatifs
write.csv(res_sig_df,
          "results/tables/DESeq2_results_significant_genes.csv",
          row.names = FALSE)

# Sauvegarder top up/down
write.csv(as.data.frame(up_regulated),
          "results/tables/DESeq2_upregulated_genes.csv",
          row.names = TRUE)

write.csv(as.data.frame(down_regulated),
          "results/tables/DESeq2_downregulated_genes.csv",
          row.names = TRUE)

cat("Résultats sauvegardés dans results/tables/\n")
#> Résultats sauvegardés dans results/tables/

10 Visualisations des Résultats

10.1 MA Plot

MA PLOT

  • Axe X : Moyenne des comptages normalisés (log scale) = baseMean
  • Axe Y : log2 Fold Change
  • Points rouges : Significatifs (padj < 0.05)
  • Points gris : Non significatifs

Permet de voir si il y a un biais lié au niveau d’expression.

# Préparer les données
res_df <- as.data.frame(res) %>%
    mutate(
        regulation = case_when(
            padj < 0.05 & log2FoldChange > 1 ~ "Up",
            padj < 0.05 & log2FoldChange < -1 ~ "Down",
            TRUE ~ "NS"
        )
    )

# Créer le MA plot
p_ma <- ggplot(res_df, aes(x = baseMean, y = log2FoldChange, color = regulation)) +
    geom_point(alpha = 0.4, size = 0.8) +
    scale_x_log10(
        labels = scales::scientific,
        breaks = scales::trans_breaks("log10", function(x) 10^x)
    ) +
    scale_color_manual(
        values = c("Up" = "#E41A1C", "Down" = "#377EB8", "NS" = "grey60"),
        labels = c(
            "Up" = paste0("Sur-exprimés (", nrow(up_regulated), ")"),
            "Down" = paste0("Sous-exprimés (", nrow(down_regulated), ")"),
            "NS" = "Non significatif"
        ),
        name = "Régulation"
    ) +
    geom_hline(yintercept = c(-1, 1), linetype = "dashed", 
               color = "black", size = 0.5) +
    geom_hline(yintercept = 0, linetype = "solid", 
               color = "black", size = 0.3) +
    labs(
        title = "MA Plot - Analyse Différentielle",
        subtitle = "Cancer Colorectal TCGA-COAD (FDR < 0.05, |log2FC| > 1)",
        x = "Expression Moyenne (log10)",
        y = "Log2 Fold Change (Tumor / Normal)"
    ) +
    theme_minimal(base_size = 12) +
    theme(
        legend.position = "right",
        plot.title = element_text(face = "bold", size = 14),
        axis.title = element_text(face = "bold"),
        panel.grid.minor = element_blank()
    )

print(p_ma)

# Sauvegarder
ggsave("figures/DEG/MA_plot.pdf", p_ma, width = 12, height = 8)

QUESTION 7a : Les gènes très fortement différentiellement exprimés ont-ils tendance
à être fortement ou faiblement exprimés?

QUESTION 7b : Pourquoi y a-t-il un “entonnoir” (variance plus faible)
pour les gènes très exprimés?

QUESTION 7c : Identifiez-vous un biais technique potentiel?


10.2 Volcano Plot (EnhancedVolcano)

cat("=== VOLCANO PLOT ===\n\n")
#> === VOLCANO PLOT ===
# S'assurer que les p-values ne sont pas nulles
res$padj[res$padj == 0] <- min(res$padj[res$padj > 0]) * 1e-1

# Top gènes à annoter : up et down
top20_up <- res[order(res$log2FoldChange, decreasing = TRUE), ][1:20, ]
top20_down <- res[order(res$log2FoldChange, decreasing = FALSE), ][1:20, ]

genes_to_label <- c(
    rownames(top20_up)[1:8],     # Top 8 up
    rownames(top20_down)[1:8]    # Top 8 down
)

# Volcano plot avec EnhancedVolcano
p_volcano <- EnhancedVolcano(
    res,
    lab = rownames(res),
    x = 'log2FoldChange',
    y = 'padj',
    selectLab = genes_to_label,
    title = 'Volcano Plot - Analyse Différentielle',
    subtitle = 'Cancer Colorectal TCGA-COAD (Tumor vs Normal)',
    pCutoff = 0.05,
    FCcutoff = 1,
    pointSize = 2.5,
    labSize = 5.0,
    labCol = 'black',
    labFace = 'bold',
    boxedLabels = TRUE,
    parseLabels = FALSE,
    col = c('grey30', 'forestgreen', 'royalblue', 'red2'),  # couleurs par catégorie
    colAlpha = 0.6,
    legendLabels = c(
        'NS',
        expression(log[2]~FC),
        'p-value',
        expression(p-value~and~log[2]~FC)
    ),
    legendPosition = 'right',
    legendLabSize = 12,
    legendIconSize = 4.0,
    drawConnectors = TRUE,
    widthConnectors = 0.5,
    colConnectors = 'grey30',
    arrowheads = FALSE,
    gridlines.major = TRUE,
    gridlines.minor = FALSE,
    border = 'partial',
    borderWidth = 1.5,
    borderColour = 'black',
    max.overlaps = 20
)

print(p_volcano)

# Sauvegarder en PDF
ggsave("figures/DEG/volcano_plot.pdf", p_volcano, width = 14, height = 10)
cat("✓ Volcano plot sauvegardé : figures/DEG/volcano_plot.pdf\n")
#> ✓ Volcano plot sauvegardé : figures/DEG/volcano_plot.pdf

QUESTION 8a : Identifiez le gène avec le log2FC le plus élevé. Quelle est sa fonction?

QUESTION 8b : Identifiez le gène avec la p-value ajustée la plus faible.

QUESTION 8c : Quels gènes se situent dans le “coin supérieur droit”
(fort FC + très significatif)?

QUESTION 8d : Ces gènes sont-ils cohérents avec la biologie du CRC?


10.3 Heatmap des Gènes Significatifs

10.3.1 Top 50 Gènes par padj

# Sélectionner top 50 gènes
top_genes <- head(rownames(res_sig), 50)

# Extraire comptages VST
top_counts <- vst_counts[top_genes, ]

# Z-score (centrer et réduire par gène)
top_counts_scaled <- t(scale(t(top_counts)))

# Remplacer Ensembl IDs par symboles pour labels
gene_labels <- res_df$symbol[match(rownames(top_counts_scaled), res_df$gene)]
gene_labels[is.na(gene_labels)] <- rownames(top_counts_scaled)[is.na(gene_labels)]
rownames(top_counts_scaled) <- gene_labels

# Annotations colonnes
annotation_col <- data.frame(
  Type = metadata$condition,
  row.names = colnames(top_counts_scaled)
)

annotation_colors <- list(
  Type = c(Normal = "steelblue", Tumor = "coral")
)

pdf("figures/DEG/heatmap_top50_genes.pdf", width = 12, height = 14)
pheatmap(
  top_counts_scaled,
  annotation_col = annotation_col,
  annotation_colors = annotation_colors,
  color = colorRampPalette(c("blue", "white", "red"))(100),
  scale = "none",
  clustering_distance_rows = "euclidean",
  clustering_distance_cols = "euclidean",
  clustering_method = "ward.D2",
  show_rownames = TRUE,
  show_colnames = FALSE,
  fontsize_row = 8,
  main = "Top 50 Gènes Différentiellement Exprimés (Z-score)"
)

dev.off()
#> pdf 
#>   3

QUESTION 9a : Observe-t-on un regroupement clair Tumor vs Normal?

QUESTION 9b : Certains gènes ont-ils des patterns d’expression particuliers?

QUESTION 9c : Y a-t-il des sous-groupes de tumeurs visibles?

QUESTION 9d : Comment interpréter les valeurs du Z-score (bleu vs rouge)?


10.3.2 Heatmap Top Up et Top Down

# Top 25 up et top 25 down
up_top <- head(rownames(up_regulated[order(up_regulated$log2FoldChange, decreasing = TRUE), ]), 25)
down_top <- head(rownames(down_regulated[order(down_regulated$log2FoldChange, decreasing = FALSE), ]), 25)

selected_genes <- c(up_top, down_top)

# Extraire comptages
selected_counts <- vst_counts[selected_genes, ]
selected_counts_scaled <- t(scale(t(selected_counts)))

# Labels
gene_labels <- res_df$symbol[match(rownames(selected_counts_scaled), res_df$gene)]
gene_labels[is.na(gene_labels)] <- rownames(selected_counts_scaled)[is.na(gene_labels)]
rownames(selected_counts_scaled) <- gene_labels

pdf("figures/DEG/heatmap_top_up_down.pdf", width = 14, height = 16)
pheatmap(
  selected_counts_scaled,
  annotation_col = annotation_col,
  annotation_colors = annotation_colors,
  color = colorRampPalette(c("blue", "white", "red"))(100),
  scale = "none",
  clustering_distance_rows = "euclidean",
  clustering_distance_cols = "euclidean",
  clustering_method = "ward.D2",
  show_rownames = TRUE,
  show_colnames = FALSE,
  fontsize_row = 7,
  main = "Top 25 Up et Top 25 Down Regulated Genes"
)

dev.off()
#> pdf 
#>   3

10.4 Count Plots de Gènes Individuels

10.4.1 Top 5 Gènes Sur-exprimés et sous-exprimés

# Top 5 gènes up‑régulés
top5_up <- head(rownames(up_regulated[order(up_regulated$log2FoldChange, decreasing = TRUE), ]), 5)

# Top 5 gènes down‑régulés
top5_down <- head(rownames(down_regulated[order(down_regulated$log2FoldChange, decreasing = FALSE), ]), 5)


plot_gene_counts <- function(gene_id, dds, metadata) {
  
  # Extraire comptages normalisés
  counts_norm <- counts(dds, normalized = TRUE)[gene_id, ]
  
  # Récupérer symbole du gène
  symbol <- res_df$symbol[res_df$gene == gene_id]
  if(length(symbol) == 0 || is.na(symbol)) symbol <- gene_id
  
  # log2FC et padj
  log2fc <- res[gene_id, "log2FoldChange"]
  padj <- res[gene_id, "padj"]
  
  padj_label <- ifelse(
  is.na(padj),
  "NA",
  ifelse(padj < 1e-10,
         "< 1e-10",
         formatC(padj, format = "e", digits = 2))
  )
  
  # Créer dataframe pour ggplot
  df_plot <- data.frame(
    Sample = names(counts_norm),
    Counts = counts_norm,
    Type = metadata$condition
  )
  
  # Créer plot
  p <- ggplot(df_plot, aes(x = Type, y = Counts, fill = Type)) +
    geom_boxplot(outlier.shape = NA, alpha = 0.7) +
    geom_jitter(width = 0.2, size = 1, alpha = 0.5) +
    scale_fill_manual(values = c("Normal" = "steelblue", "Tumor" = "coral")) +
    labs(
      title = paste0(symbol, " (", gene_id, ")"),
      subtitle = paste0("log2FC = ", round(log2fc, 2),
                  ", padj = ", padj_label),
      y = "Normalized Counts"
    ) +
    theme_minimal() +
    theme(legend.position = "none")
  
  return(p)
}


# top5 up

plots_up <- lapply(top5_up, plot_gene_counts, dds = dds, metadata = metadata)
combined_up <- wrap_plots(plots_up, ncol = 3)

print(combined_up)

ggsave("figures/DEG/count_plots_top5_upregulated.pdf", combined_up, width = 14, height = 10)
cat("✅ Top 5 up‑regulated saved: figures/DEG/count_plots_top5_upregulated.pdf\n")
#> ✅ Top 5 up‑regulated saved: figures/DEG/count_plots_top5_upregulated.pdf
# -------------------------------
# Créer plots pour top5 down
# -------------------------------
plots_down <- lapply(top5_down, plot_gene_counts, dds = dds, metadata = metadata)
combined_down <- wrap_plots(plots_down, ncol = 3)

print(combined_down)

ggsave("figures/DEG/count_plots_top5_downregulated.pdf", combined_down, width = 14, height = 10)
cat("✅ Top 5 down‑regulated saved: figures/DEG/count_plots_top5_downregulated.pdf\n")
#> ✅ Top 5 down‑regulated saved: figures/DEG/count_plots_top5_downregulated.pdf

11 Enrichissement Fonctionnel

11.1 Préparation des Listes de Gènes

cat("=== PRÉPARATION DES LISTES DE GÈNES ===\n\n")
#> === PRÉPARATION DES LISTES DE GÈNES ===
cat("Exemple de rownames(res_sig):", head(rownames(res_sig)), "\n")
#> Exemple de rownames(res_sig): HIF3A RTN4RL2 CAMK4 RNF112 SPN LRRTM1
# Utiliser SYMBOL comme keytype
genes_all <- rownames(res_sig)
genes_entrez <- mapIds(org.Hs.eg.db,
                       keys = genes_all,
                       column = "ENTREZID",
                       keytype = "SYMBOL",
                       multiVals = "first")

# Enlever NA
genes_entrez <- genes_entrez[!is.na(genes_entrez)]

cat("Gènes significatifs (SYMBOL):", length(genes_all), "\n")
#> Gènes significatifs (SYMBOL): 4316
cat("Gènes avec Entrez ID:", length(genes_entrez), "\n")
#> Gènes avec Entrez ID: 3757
# Liste UP
genes_up <- rownames(up_regulated)
genes_up_entrez <- mapIds(org.Hs.eg.db,
                          keys = genes_up,
                          column = "ENTREZID",
                          keytype = "SYMBOL",
                          multiVals = "first")
genes_up_entrez <- genes_up_entrez[!is.na(genes_up_entrez)]
cat("\nGènes UP (Entrez):", length(genes_up_entrez), "\n")
#> 
#> Gènes UP (Entrez): 1702
# Liste DOWN
genes_down <- rownames(down_regulated)
genes_down_entrez <- mapIds(org.Hs.eg.db,
                            keys = genes_down,
                            column = "ENTREZID",
                            keytype = "SYMBOL",
                            multiVals = "first")
genes_down_entrez <- genes_down_entrez[!is.na(genes_down_entrez)]
cat("Gènes DOWN (Entrez):", length(genes_down_entrez), "\n")
#> Gènes DOWN (Entrez): 2055

11.2 Gene Ontology (GO) Enrichment - Biological Process

cat("=== GO ENRICHMENT: BIOLOGICAL PROCESS ===\n\n")
#> === GO ENRICHMENT: BIOLOGICAL PROCESS ===
# Over-Representation Analysis (ORA)
ego_bp <- enrichGO(
  gene = genes_entrez,
  OrgDb = org.Hs.eg.db,
  ont = "BP",  # Biological Process
  pAdjustMethod = "BH",
  pvalueCutoff = 0.05,
  qvalueCutoff = 0.05,
  readable = TRUE  # Convertir en symboles
)

cat("Termes GO-BP enrichis:", nrow(ego_bp), "\n\n")
#> Termes GO-BP enrichis: 1520
if(nrow(ego_bp) > 0) {
  # Aperçu
  cat("Top 10 termes GO-BP:\n")
  print(head(as.data.frame(ego_bp), 10))
  
  # Dotplot
  p_go_bp <- dotplot(ego_bp, showCategory = 20) +
    ggtitle("GO Biological Process Enrichment") +
    theme(axis.text.y = element_text(size = 10))
  
  print(p_go_bp)
  
  ggsave("figures/enrichment/GO_BP_dotplot.pdf", p_go_bp, width = 12, height = 10)
  
  # Barplot
  p_go_bp_bar <- barplot(ego_bp, showCategory = 15) +
    ggtitle("Top 15 GO-BP Terms")
  
  print(p_go_bp_bar)
  
  ggsave("figures/enrichment/GO_BP_barplot.pdf", p_go_bp_bar, width = 12, height = 10)
  
  # Sauvegarder résultats
  write.csv(as.data.frame(ego_bp), 
            "results/enrichment/GO_BP_enrichment.csv",
            row.names = FALSE)
}
#> Top 10 termes GO-BP:
#>                    ID                                          Description
#> GO:0006936 GO:0006936                                   muscle contraction
#> GO:0003012 GO:0003012                                muscle system process
#> GO:0003018 GO:0003018               vascular process in circulatory system
#> GO:0030198 GO:0030198                    extracellular matrix organization
#> GO:0043062 GO:0043062                 extracellular structure organization
#> GO:0045229 GO:0045229        external encapsulating structure organization
#> GO:0050900 GO:0050900                                  leukocyte migration
#> GO:0034765 GO:0034765 regulation of monoatomic ion transmembrane transport
#> GO:0006935 GO:0006935                                           chemotaxis
#> GO:0042330 GO:0042330                                                taxis
#>            GeneRatio   BgRatio                          pvalue
#> GO:0006936  143/3449 351/18870 0.00000000000000000000003718072
#> GO:0003012  168/3449 452/18870 0.00000000000000000000070843054
#> GO:0003018  113/3449 269/18870 0.00000000000000000008550623035
#> GO:0030198  125/3449 321/18870 0.00000000000000000177632895889
#> GO:0043062  125/3449 322/18870 0.00000000000000000238938120244
#> GO:0045229  125/3449 323/18870 0.00000000000000000320769496563
#> GO:0050900  140/3449 396/18870 0.00000000000000027147730096656
#> GO:0034765  153/3449 454/18870 0.00000000000000138541385842614
#> GO:0006935  155/3449 468/18870 0.00000000000000494992456710502
#> GO:0042330  155/3449 470/18870 0.00000000000000747246729934852
#>                               p.adjust                      qvalue
#> GO:0006936 0.0000000000000000002307436 0.0000000000000000001585856
#> GO:0003012 0.0000000000000000021982600 0.0000000000000000015108213
#> GO:0003018 0.0000000000000001768838885 0.0000000000000001215688580
#> GO:0030198 0.0000000000000027559743797 0.0000000000000018941276162
#> GO:0043062 0.0000000000000029656999485 0.0000000000000020382679226
#> GO:0045229 0.0000000000000033178258261 0.0000000000000022802771931
#> GO:0050900 0.0000000000002406840185426 0.0000000000001654174471453
#> GO:0034765 0.0000000000010747348006741 0.0000000000007386443360977
#> GO:0006935 0.0000000000034132479848282 0.0000000000023458589878257
#> GO:0042330 0.0000000000046374132059757 0.0000000000031872039470484
#>                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                               geneID
#> GO:0006936                                                                                                                                                       ATP2A1/ITGA2/ATP1A2/SPHK1/CLIC2/TMOD2/CLCN1/DMD/SRI/NMU/PTGS2/SCN2B/SGCA/GJC1/AKAP9/TRPV4/ADRA2C/CCDC78/FKBP1B/TNNI3K/EDNRA/EDNRB/PPP1R13L/ARHGAP42/TNNT1/CAV1/KCNA5/STC1/DES/SCN3B/DTNA/DMPK/SLMAP/KIT/GPD1L/STAC/PIK3CG/TPM2/CALCA/F2R/GNAO1/CKMT2/RGS2/MYH11/TNNI3/CACNB2/CACNB1/OXTR/SCN1B/PLCE1/CHRM2/HTR1D/SCNN1B/SNTB1/PPP1R12B/SCN7A/FGF13/NOS1/JSRP1/SLC9A1/CNN1/TRPM4/ADRA2A/SLC8A3/CACNA2D1/ATP2B4/CHRNA1/CHRNA3/TACR2/TACR1/UTS2/CACNA1H/CACNA1C/CACNA1D/TMOD1/KCNJ3/RCSD1/SSPN/P2RX1/P2RX4/TBX20/ADRA1B/MYOCD/COMP/GSTM2/DSC2/GRIP2/PRKG1/LMOD1/LMOD3/TPM1/NEDD4L/CALD1/DRD2/APBB1/GATA4/TNNT2/SSTR2/LTB4R/MYOT/KCND3/SMTN/PDE5A/SCN11A/BDKRB2/HOMER1/CD38/ADRB2/FXYD1/SMPX/SULF1/HRC/SLC8A1/SCN4A/SCN4B/ACTA2/PLN/PTGER3/ANK2/TNNC1/GSN/TIFAB/SLC6A8/CASQ2/HTR7/KCNMA1/FLNA/NMUR1/MYL6/MYL9/MYLK/SYNM/CHRNB2/CHRNB4/CHGA/PDE4D/MAP2K6/ABCC9/RYR1/RYR3/EDN3/EDN2/CRYAB
#> GO:0003012 ATP2A1/ITGA2/ATP1A2/SPHK1/CLIC2/TMOD2/CLCN1/DMD/SRI/NMU/PTGS2/SCN2B/SGCA/GJC1/IL6ST/AKAP6/AKAP9/TRPV4/ADRA2C/CCDC78/FKBP1B/KLF15/TNNI3K/EDNRA/EDNRB/FBXO32/PPP1R13L/PI16/ARHGAP42/TNNT1/CAV1/MEIS1/KCNA5/STC1/DES/SCN3B/DTNA/DMPK/LMCD1/SLMAP/SLN/KIT/GPD1L/STAC/AGT/PDE9A/PIK3CG/TPM2/CALCA/F2R/GNAO1/CKMT2/RGS2/MYH11/TRIM72/TNNI3/CACNB2/CACNB1/OXTR/SCN1B/PLCE1/SORBS2/MIR17HG/CHRM2/HTR1D/SCNN1B/SNTB1/PPP1R12B/NR4A3/SCN7A/FGF13/HMOX1/MYOC/NOS1/JSRP1/SLC9A1/CNN1/TRPM4/ADRA2A/SLC8A3/CACNA2D1/ATP2B4/CHRNA1/CHRNA3/TACR2/TACR1/UTS2/IGF1/CACNA1H/CACNA1C/CACNA1D/TMOD1/KCNJ3/RCSD1/SSPN/P2RX1/P2RX4/TBX20/ADRA1B/MYOCD/IL1B/COMP/GSTM2/DSC2/GRIP2/PRKG1/LMOD1/LMOD3/TPM1/NEDD4L/CALD1/P2RY1/DRD2/APBB1/GATA4/TNNT2/SSTR2/LTB4R/MYOT/KCND3/SMTN/PDE5A/SCN11A/TIAM1/BDKRB2/HOMER1/CD38/ADRB2/FXYD1/SMPX/SULF1/CAMK2B/HRC/GTF2IRD1/SLC8A1/SCN4A/SCN4B/ACTA2/PLN/PTGER3/ANK2/TNNC1/EZH2/GSN/TIFAB/SLC6A8/CASQ2/HTR7/KCNMA1/FLNA/HAND2/NMUR1/MYL6/MYL9/MYLK/SYNM/CHRNB2/CHRNB4/CHGA/PDE4D/MAP2K6/ABCC8/ABCC9/RYR1/RYR3/EDN3/EDN2/CRYAB
#> GO:0003018                                                                                                                                                                                                                                                                                                  SLC12A2/ATP2A3/ATP1A2/PTGS2/GPR4/ABCG2/SLC6A20/TRPV4/ADRA2C/ANGPT1/HBB/SLC22A3/EDNRA/EDNRB/CYSLTR1/ECE1/ARHGAP42/CAV1/KCNA5/KAT2B/CLDN5/FOXC1/AGT/SLC7A5/TJP3/CALCA/F2R/RGS2/PTP4A3/BMP6/SLC1A2/SLC2A13/OXTR/SLC7A1/PDE3A/FERMT2/HTR1D/SCNN1B/SERPINF2/P2RY2/SLC6A1/SLC4A4/SLC4A8/NOS1/FGA/FGB/ATP1B2/ABCB1/TRPM4/ADRA2A/SLC8A2/SLC28A2/ATP2B4/TACR2/TACR1/UTS2/SLC15A2/KCNMB1/P2RX1/ADRA1B/ADRA1D/COMP/GRIP2/ATP8A1/PRKG1/LEPR/P2RY1/C2CD4A/C2CD4B/DRD5/SLC29A1/ADORA2A/TEK/ZEB2/BDKRB2/F2RL1/CD38/CD36/ADRB1/ADRB2/NPR1/KLF2/SLIT2/SLC2A1/SLC1A5/CEACAM1/PLOD3/KNG1/SLC8A1/PDE2A/ACTA2/VEGFA/AKAP12/SLC38A5/SLC38A3/SLC5A6/SLC6A6/SH3GL2/HTR7/KCNMA1/SLC6A9/SLC1A1/SLC22A5/AGTR1/ABCC8/ABCC9/ABCC1/ABCC2/EDN3/EDN2/CPS1/SLC2A4/SLC13A3
#> GO:0030198                                                                                                                                                                                           COL7A1/ITGA8/COL4A5/COL4A1/DPT/LOX/BMP2/COL14A1/MMP25/MATN3/MATN2/FKBP10/PDPN/RECK/PRSS1/TNFRSF11B/COL10A1/COL28A1/CAV1/COL4A6/SERPINB5/SMOC1/COL11A1/COL11A2/VIT/ZNF469/FOXC1/FOXF1/FOXF2/AGT/COL23A1/GAS2/FLRT2/FSCN1/COL1A2/MYH11/COL27A1/RUNX1/ADAMTSL2/ADAMTSL3/ADAMTSL1/ADAMTSL4/ADAMTSL5/CTSG/POSTN/FGFR4/CTSS/ADAMTS2/ADAMTS8/PTX3/FERMT1/FBLN1/FBLN5/MMP8/MMP9/MMP7/MMP3/SERPINF2/FAP/COL9A2/COL9A3/COL9A1/MMP1/MFAP4/COL22A1/ITGB3/MMP28/COL5A2/COL5A3/COL5A1/COL8A1/KLK7/SOX9/MAD2L2/DDR2/COL1A1/COL17A1/WDR72/TGFBI/SMPD3/COL12A1/COMP/IBSP/HPSE2/ANGPTL7/MMP14/MMP10/MMP11/MMP12/MMP13/PRDX4/PDGFRA/TPSAB1/GPM6B/LAMA1/SERPINH1/LOXL2/LOXL4/ABI3BP/AXIN2/ADAMTS12/ADAMTS14/ADAMTS18/SPINK5/HPN/C6orf15/WT1/ADAM8/EGFL6/SULF1/COL3A1/PLOD3/TLL1/CCDC80/OLFML2B/OLFML2A/NTN4/ADAMTS4/ADAMTS7/ADAMTS6/ADAMTS1/COL24A1/TNXB/CMA1/APLP1
#> GO:0043062                                                                                                                                                                                           COL7A1/ITGA8/COL4A5/COL4A1/DPT/LOX/BMP2/COL14A1/MMP25/MATN3/MATN2/FKBP10/PDPN/RECK/PRSS1/TNFRSF11B/COL10A1/COL28A1/CAV1/COL4A6/SERPINB5/SMOC1/COL11A1/COL11A2/VIT/ZNF469/FOXC1/FOXF1/FOXF2/AGT/COL23A1/GAS2/FLRT2/FSCN1/COL1A2/MYH11/COL27A1/RUNX1/ADAMTSL2/ADAMTSL3/ADAMTSL1/ADAMTSL4/ADAMTSL5/CTSG/POSTN/FGFR4/CTSS/ADAMTS2/ADAMTS8/PTX3/FERMT1/FBLN1/FBLN5/MMP8/MMP9/MMP7/MMP3/SERPINF2/FAP/COL9A2/COL9A3/COL9A1/MMP1/MFAP4/COL22A1/ITGB3/MMP28/COL5A2/COL5A3/COL5A1/COL8A1/KLK7/SOX9/MAD2L2/DDR2/COL1A1/COL17A1/WDR72/TGFBI/SMPD3/COL12A1/COMP/IBSP/HPSE2/ANGPTL7/MMP14/MMP10/MMP11/MMP12/MMP13/PRDX4/PDGFRA/TPSAB1/GPM6B/LAMA1/SERPINH1/LOXL2/LOXL4/ABI3BP/AXIN2/ADAMTS12/ADAMTS14/ADAMTS18/SPINK5/HPN/C6orf15/WT1/ADAM8/EGFL6/SULF1/COL3A1/PLOD3/TLL1/CCDC80/OLFML2B/OLFML2A/NTN4/ADAMTS4/ADAMTS7/ADAMTS6/ADAMTS1/COL24A1/TNXB/CMA1/APLP1
#> GO:0045229                                                                                                                                                                                           COL7A1/ITGA8/COL4A5/COL4A1/DPT/LOX/BMP2/COL14A1/MMP25/MATN3/MATN2/FKBP10/PDPN/RECK/PRSS1/TNFRSF11B/COL10A1/COL28A1/CAV1/COL4A6/SERPINB5/SMOC1/COL11A1/COL11A2/VIT/ZNF469/FOXC1/FOXF1/FOXF2/AGT/COL23A1/GAS2/FLRT2/FSCN1/COL1A2/MYH11/COL27A1/RUNX1/ADAMTSL2/ADAMTSL3/ADAMTSL1/ADAMTSL4/ADAMTSL5/CTSG/POSTN/FGFR4/CTSS/ADAMTS2/ADAMTS8/PTX3/FERMT1/FBLN1/FBLN5/MMP8/MMP9/MMP7/MMP3/SERPINF2/FAP/COL9A2/COL9A3/COL9A1/MMP1/MFAP4/COL22A1/ITGB3/MMP28/COL5A2/COL5A3/COL5A1/COL8A1/KLK7/SOX9/MAD2L2/DDR2/COL1A1/COL17A1/WDR72/TGFBI/SMPD3/COL12A1/COMP/IBSP/HPSE2/ANGPTL7/MMP14/MMP10/MMP11/MMP12/MMP13/PRDX4/PDGFRA/TPSAB1/GPM6B/LAMA1/SERPINH1/LOXL2/LOXL4/ABI3BP/AXIN2/ADAMTS12/ADAMTS14/ADAMTS18/SPINK5/HPN/C6orf15/WT1/ADAM8/EGFL6/SULF1/COL3A1/PLOD3/TLL1/CCDC80/OLFML2B/OLFML2A/NTN4/ADAMTS4/ADAMTS7/ADAMTS6/ADAMTS1/COL24A1/TNXB/CMA1/APLP1
#> GO:0050900                                                                                                                                                                                          SPN/SLC12A2/ITGA2/ITGA7/CHST4/ITGAL/GPR15/RAC3/ZP3/LGALS3/BMP5/ADD2/CCL28/CCL21/CCL20/CCL23/CCL25/CCL24/CCL26/CD200R1/TRPV4/RARRES2/PODXL2/IL6R/TBX21/CNR2/GP1BA/SAA1/EDNRB/IL23A/CX3CR1/PTPRO/F7/MSMP/ROR2/KIT/MCOLN2/PF4/WASL/AIF1/GPR183/PIK3CG/JAM3/CD177/CSF1R/XCL2/CALCA/CXCL11/CXCL10/CXCL13/CXCL12/CXCL17/JAM2/BSG/NCKAP1L/MIF/CTSG/CXCL1/CXCL3/CXCL2/CXCL5/CXCL6/SCG2/VAV1/PGF/SLAMF1/THBS4/SPNS2/TNFSF11/HMOX1/ITGB3/ITGB7/LBP/MMP28/HCK/CNN2/TRPM4/CCR2/CCR7/XCL1/PDGFD/TACR1/CMKLR1/LYVE1/PTN/P2RX4/CCL3/CCL8/PADI2/TRPM2/SMPD3/IL1B/IL10/IL16/IL1A/MAPK3/CCL11/CCL13/CCL14/CCL15/CCL19/THY1/MMP14/P2RY12/IL33/PTK2B/CSF1/NOD2/WNT5A/DPEP1/SERPINE1/IL17A/ASCL2/BST1/BDKRB1/ASB2/F2RL1/CCL5/MADCAM1/PPBP/ADAM8/TNFRSF11A/KLRK1/SLIT2/DUSP1/VEGFA/PTGER4/CXCR5/TREM1/TREM2/NKX2-3/RIPK3/STAP1/GPR18/TNFAIP6/CHGA/KITLG/EDN3/EDN2/SELE
#> GO:0034765                                                                                      ATP2A1/GRP/ATP1A2/HAMP/CLIC5/CLIC2/CLIC3/P2RX7/CLCN2/CLCN1/DMD/SRI/LRRC26/SCN2B/CHD7/KCNK5/GRIN2B/GRIN2D/AKAP5/AKAP6/AKAP9/KCNAB1/HVCN1/FKBP1B/AHNAK/PCSK9/EDNRA/CABP4/TESC/CAV1/KCNA5/KCNA3/BCL2/SCN3A/SCN3B/STIM1/PLP1/SLMAP/SLN/GPD1L/HPCA/STAC/PIK3CG/KCNIP4/KCNIP3/CHP2/KCNG1/KCNG3/F2R/NTSR1/KCNQ4/CACNB2/CACNB1/CXCL11/CXCL10/HCN1/SCN1B/CTSS/TMEM37/GRIN2A/VAMP2/JPH2/JPH3/JPH1/JPH4/NLGN3/MMP9/HECW1/HECW2/SCN7A/FGF13/KCNJ11/KCNJ15/KCNJ14/NOS1/ITGB3/ATP1B2/SCN9A/STOM/ABCB1/JSRP1/SLC9A1/TRPM5/ADRA2A/CCR2/CACNA2D1/ATP2B4/CACNG4/XCL1/TLR9/KCNH4/KCNH1/KCNH8/RELN/NPSR1/CACNA1H/CACNA1A/CACNA1C/CACNA1D/CACNA1E/SLC31A2/KCNJ3/KCNMB1/P2RX1/P2RX4/LRRC55/GSTM2/MS4A1/KCNN4/OSR1/THY1/PIRT/KCNK10/PTK2B/TRPC6/NEDD4L/PDZK1/DRD2/KCNB1/F2/KCND3/SCN11A/BDKRB1/HOMER1/KCNK15/ADRB2/FXYD6/FXYD5/FXYD2/FXYD3/FXYD1/CD19/HRC/SNCA/DPP6/SLC8A1/SCN4A/SCN4B/KCNQ5/PLN/ANK2/ANK3/DPP10/TREM2/EPHB2/FHL1/KCNS3/CASQ2/KCNMA1/FLNA/SLC6A9/PDE4D/EDN3
#> GO:0006935                                                                                              SPN/SLC12A2/ITGA2/RAC3/GAB1/LGALS3/LOX/SEMA6D/NTRK3/BMP4/CCL28/CCL21/CCL20/CCL23/CCL25/CCL24/CCL26/TRPV4/RARRES2/FGFR1/ANGPT2/ANGPT1/IL6R/UNC5C/WNT3/CNR2/SAA1/LGR6/EDNRB/IL23A/CYSLTR1/CX3CR1/PTPRO/F7/SERPIND1/CMTM8/MSMP/CMTM7/KIT/L1CAM/FOSL1/PF4/AIF1/GPR183/PIK3CG/JAM3/CSF1R/XCL2/CALCA/FLRT2/CXCL11/CXCL10/CXCL13/CXCL12/CXCL17/NTN1/BSG/NCKAP1L/MIF/LPAR1/ROBO2/CTSG/CXCL1/CXCL3/CXCL2/CXCL5/CXCL6/SCG2/VAV1/LEF1/PGF/NRG1/SEMA4G/SLAMF1/PLD1/DOCK2/EPHA7/FGF18/SEMA3E/SEMA3D/SEMA3G/THBS4/TNFSF11/LSP1/ITGB3/LBP/MMP28/TRPM4/CCR2/CCR3/CCR7/CCR8/CCR9/EFNA5/XCL1/PDGFD/BIN2/CMKLR1/PTN/P2RX4/CCL3/CCL8/PADI2/TRPM2/IL1B/IL10/IL16/MAPK3/CCL11/CCL13/CCL14/CCL15/CCL19/P2RY12/PDGFRA/PTK2B/CSF1/NOD2/MET/WNT5A/DPEP1/SERPINE1/BST1/TUBB2B/TIAM1/SLIT3/F2RL1/CCL5/PPBP/CCR10/ADAM8/TNFRSF11A/SEMA3B/KLRK1/SLIT2/SEMA6A/FGF7/FGF2/DEFB1/DUSP1/VEGFA/CXCR5/TREM1/TREM2/STAP1/EPHB1/GPR18/TNFAIP6/AGTR1/CHGA/ABCC1/EDN3/EDN2/ENPP2/PLAU
#> GO:0042330                                                                                              SPN/SLC12A2/ITGA2/RAC3/GAB1/LGALS3/LOX/SEMA6D/NTRK3/BMP4/CCL28/CCL21/CCL20/CCL23/CCL25/CCL24/CCL26/TRPV4/RARRES2/FGFR1/ANGPT2/ANGPT1/IL6R/UNC5C/WNT3/CNR2/SAA1/LGR6/EDNRB/IL23A/CYSLTR1/CX3CR1/PTPRO/F7/SERPIND1/CMTM8/MSMP/CMTM7/KIT/L1CAM/FOSL1/PF4/AIF1/GPR183/PIK3CG/JAM3/CSF1R/XCL2/CALCA/FLRT2/CXCL11/CXCL10/CXCL13/CXCL12/CXCL17/NTN1/BSG/NCKAP1L/MIF/LPAR1/ROBO2/CTSG/CXCL1/CXCL3/CXCL2/CXCL5/CXCL6/SCG2/VAV1/LEF1/PGF/NRG1/SEMA4G/SLAMF1/PLD1/DOCK2/EPHA7/FGF18/SEMA3E/SEMA3D/SEMA3G/THBS4/TNFSF11/LSP1/ITGB3/LBP/MMP28/TRPM4/CCR2/CCR3/CCR7/CCR8/CCR9/EFNA5/XCL1/PDGFD/BIN2/CMKLR1/PTN/P2RX4/CCL3/CCL8/PADI2/TRPM2/IL1B/IL10/IL16/MAPK3/CCL11/CCL13/CCL14/CCL15/CCL19/P2RY12/PDGFRA/PTK2B/CSF1/NOD2/MET/WNT5A/DPEP1/SERPINE1/BST1/TUBB2B/TIAM1/SLIT3/F2RL1/CCL5/PPBP/CCR10/ADAM8/TNFRSF11A/SEMA3B/KLRK1/SLIT2/SEMA6A/FGF7/FGF2/DEFB1/DUSP1/VEGFA/CXCR5/TREM1/TREM2/STAP1/EPHB1/GPR18/TNFAIP6/AGTR1/CHGA/ABCC1/EDN3/EDN2/ENPP2/PLAU
#>            Count
#> GO:0006936   143
#> GO:0003012   168
#> GO:0003018   113
#> GO:0030198   125
#> GO:0043062   125
#> GO:0045229   125
#> GO:0050900   140
#> GO:0034765   153
#> GO:0006935   155
#> GO:0042330   155

QUESTION 10a : Quels sont les 3 processus biologiques les plus enrichis?

QUESTION 10b : Ces processus sont-ils cohérents avec les hallmarks du cancer?

QUESTION 10c : Identifiez-vous des processus liés à l’immunité/inflammation?

QUESTION 10d : Quel est le “GeneRatio” et que signifie-t-il?


11.3 GO Enrichment - Séparé Up vs Down

# GO sur gènes UP uniquement
cat("=== GO ENRICHMENT: GÈNES UP-RÉGULÉS ===\n")
#> === GO ENRICHMENT: GÈNES UP-RÉGULÉS ===
ego_bp_up <- enrichGO(
  gene = genes_up_entrez,
  OrgDb = org.Hs.eg.db,
  ont = "BP",
  pAdjustMethod = "BH",
  pvalueCutoff = 0.05,
  readable = TRUE
)

cat("Termes GO-BP (UP):", nrow(ego_bp_up), "\n\n")
#> Termes GO-BP (UP): 666
if(nrow(ego_bp_up) > 0) {
  p_go_up <- dotplot(ego_bp_up, showCategory = 15) +
    ggtitle("GO-BP: Gènes Sur-exprimés (Tumor)")
  
  print(p_go_up)
  ggsave("figures/enrichment/GO_BP_upregulated.pdf", p_go_up, width = 12, height = 10)
}

# GO sur gènes DOWN uniquement
cat("=== GO ENRICHMENT: GÈNES DOWN-RÉGULÉS ===\n")
#> === GO ENRICHMENT: GÈNES DOWN-RÉGULÉS ===
ego_bp_down <- enrichGO(
  gene = genes_down_entrez,
  OrgDb = org.Hs.eg.db,
  ont = "BP",
  pAdjustMethod = "BH",
  pvalueCutoff = 0.05,
  readable = TRUE
)

cat("Termes GO-BP (DOWN):", nrow(ego_bp_down), "\n\n")
#> Termes GO-BP (DOWN): 1116
if(nrow(ego_bp_down) > 0) {
  p_go_down <- dotplot(ego_bp_down, showCategory = 15) +
    ggtitle("GO-BP: Gènes Sous-exprimés (Normal)")
  
  print(p_go_down)
  ggsave("figures/enrichment/GO_BP_downregulated.pdf", p_go_down, width = 12, height = 10)
}

# Combiner
if(nrow(ego_bp_up) > 0 && nrow(ego_bp_down) > 0) {
  p_combined_go <- p_go_up / p_go_down
  ggsave("figures/enrichment/GO_BP_up_vs_down.pdf", p_combined_go, width = 14, height = 16)
}

11.4 KEGG Pathway Enrichment (ORA)

cat("=== KEGG PATHWAY ENRICHMENT ===\n\n")
#> === KEGG PATHWAY ENRICHMENT ===
kegg_enrich <- enrichKEGG(
  gene = genes_entrez,
  organism = "hsa",  # Homo sapiens
  pvalueCutoff = 0.05,
  pAdjustMethod = "BH"
)

cat("Pathways KEGG enrichis:", nrow(kegg_enrich), "\n\n")
#> Pathways KEGG enrichis: 84
if(nrow(kegg_enrich) > 0) {
  # Aperçu
  cat("Top 10 pathways KEGG:\n")
  print(head(as.data.frame(kegg_enrich), 10))
  
  # Dotplot
  p_kegg <- dotplot(kegg_enrich, showCategory = 20) +
    ggtitle("KEGG Pathway Enrichment")
  
  print(p_kegg)
  
  ggsave("figures/enrichment/KEGG_dotplot.pdf", p_kegg, width = 12, height = 10)
  
  # Barplot
  p_kegg_bar <- barplot(kegg_enrich, showCategory = 15) +
    ggtitle("Top 15 KEGG Pathways")
  
  print(p_kegg_bar)
  ggsave("figures/enrichment/KEGG_barplot.pdf", p_kegg_bar, width = 12, height = 10)
  
  # Sauvegarder
  write.csv(as.data.frame(kegg_enrich),
            "results/enrichment/KEGG_enrichment.csv",
            row.names = FALSE)
}
#> Top 10 pathways KEGG:
#>                                      category
#> hsa04060 Environmental Information Processing
#> hsa04020 Environmental Information Processing
#> hsa04820                                 <NA>
#> hsa04061 Environmental Information Processing
#> hsa04974                   Organismal Systems
#> hsa04082                                 <NA>
#> hsa04080 Environmental Information Processing
#> hsa04610                   Organismal Systems
#> hsa04024 Environmental Information Processing
#> hsa04062                   Organismal Systems
#>                                  subcategory       ID
#> hsa04060 Signaling molecules and interaction hsa04060
#> hsa04020                 Signal transduction hsa04020
#> hsa04820                                <NA> hsa04820
#> hsa04061 Signaling molecules and interaction hsa04061
#> hsa04974                    Digestive system hsa04974
#> hsa04082                                <NA> hsa04082
#> hsa04080 Signaling molecules and interaction hsa04080
#> hsa04610                       Immune system hsa04610
#> hsa04024                 Signal transduction hsa04024
#> hsa04062                       Immune system hsa04062
#>                                                            Description
#> hsa04060                        Cytokine-cytokine receptor interaction
#> hsa04020                                     Calcium signaling pathway
#> hsa04820                                  Cytoskeleton in muscle cells
#> hsa04061 Viral protein interaction with cytokine and cytokine receptor
#> hsa04974                              Protein digestion and absorption
#> hsa04082                                  Neuroactive ligand signaling
#> hsa04080                       Neuroactive ligand-receptor interaction
#> hsa04610                           Complement and coagulation cascades
#> hsa04024                                        cAMP signaling pathway
#> hsa04062                                   Chemokine signaling pathway
#>          GeneRatio  BgRatio                   pvalue              p.adjust
#> hsa04060  116/1831 298/9561 0.0000000000000006052336 0.0000000000002094108
#> hsa04020   98/1831 254/9561 0.0000000000002265597081 0.0000000000391948295
#> hsa04820   90/1831 233/9561 0.0000000000020533106846 0.0000000002368151656
#> hsa04061   49/1831 100/9561 0.0000000000126451529765 0.0000000010938057325
#> hsa04974   50/1831 105/9561 0.0000000000297901806092 0.0000000020614804982
#> hsa04082   77/1831 199/9561 0.0000000000751464583088 0.0000000043334457625
#> hsa04080  118/1831 370/9561 0.0000000017180013396440 0.0000000849183519310
#> hsa04610   40/1831  88/9561 0.0000000151007312592004 0.0000006531066269604
#> hsa04024   78/1831 226/9561 0.0000000255496027151809 0.0000009822402821614
#> hsa04062   68/1831 193/9561 0.0000000810568480985331 0.0000028045669442092
#>                         qvalue
#> hsa04060 0.0000000000001471673
#> hsa04020 0.0000000000275448908
#> hsa04820 0.0000000001664262344
#> hsa04061 0.0000000007686921941
#> hsa04974 0.0000000014487435202
#> hsa04082 0.0000000030454090999
#> hsa04080 0.0000000596779412718
#> hsa04610 0.0000004589827527467
#> hsa04024 0.0000006902875119540
#> hsa04062 0.0000019709612537643
#>                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            geneID
#> hsa04060           3587/4838/655/653/652/651/650/8743/3572/56477/6366/6364/6368/6370/6369/10344/3976/3570/8807/4982/9966/51561/11009/3598/85480/1524/1439/10913/55504/5196/3588/1436/6846/654/939/6373/3627/10563/6387/284340/3557/2919/2921/2920/6374/6372/94/9518/2690/51330/355/268/8600/7292/729230/1232/1236/1237/10803/6375/1271/8744/84957/656/7850/6348/6355/115650/3553/3586/3603/4050/356/3552/6356/6357/6358/6359/6363/3589/3953/90865/8200/1435/1437/1440/3590/8794/8795/130399/768211/3605/27189/9173/8808/970/4804/5008/3977/959/6352/5473/2826/53832/8792/8742/3625/83729/8771/643/23495/608/10148/285613/3624/920
#> hsa04020                                                                                                                            814/487/489/8877/5027/4916/5579/5567/2904/2769/2906/2767/2260/84812/2774/1909/1910/57105/10800/5260/108/109/115/6786/6588/255231/810/2149/4923/113026/5333/5021/5137/5153/5136/51196/5582/2264/2263/5255/5737/2903/4485/1129/5332/9965/8817/4842/3706/6547/6543/493/1139/80310/6865/6869/8912/773/775/776/777/3708/5023/5025/147/146/116443/5156/2185/4233/1816/135/623/624/952/153/154/815/816/3270/2254/2252/2247/6546/7422/5350/5733/7134/26281/845/3363/3360/5336/4638/185/2925/6261/6263
#> hsa04820                                                                                                                           8516/3673/3679/1287/1282/477/29767/1756/6442/57644/23500/51332/57731/7138/1288/84823/1674/4703/1837/1301/1302/7169/375790/1158/1634/1278/4629/7137/85301/56776/171024/22989/2199/2192/127294/8736/6641/1298/1299/1297/7060/3690/3695/3696/482/1290/50509/1289/22801/7111/1277/8082/1311/1824/25802/56203/6712/7168/81624/7139/9260/64236/27295/284217/9499/165904/4892/84665/271/270/57159/10529/200539/1281/2318/72/8910/287/288/7134/11155/7058/255631/114793/2273/23345/1465/633/10398/1462
#> hsa04061                                                                                                                                                                                                                                                                                                                                                              3587/8743/3572/56477/6366/6364/6368/6370/6369/10344/3570/8807/11009/1524/5196/3588/1436/6846/6373/3627/10563/6387/2919/2921/2920/6374/6372/729230/1232/1236/1237/10803/6375/6348/6355/3586/6356/6357/6358/6359/6363/1435/8794/8795/6352/5473/2826/53832/643
#> hsa04974                                                                                                                                                                                                                                                                                                                                                    1294/1287/1282/206358/477/7512/8645/7373/4224/4225/5644/1300/340267/1288/11136/1301/1302/91522/6519/1278/85301/4311/6550/1298/1299/1297/169044/482/1290/50509/1289/6547/6543/1295/6564/1277/6520/1308/1303/3783/1358/1359/486/1281/6510/6546/255631/81578/6505/340024
#> hsa04082                                                                                                                                                                                                                       5368/5579/5567/2904/2906/2767/6571/57030/152/2774/2791/55970/51764/111/108/109/115/2568/2564/2563/246213/2788/2786/2785/54331/2775/6506/5582/5179/2903/1132/1129/3352/5332/60482/5029/6529/63910/2557/2892/2893/150/1139/1136/6865/6869/5023/5025/147/146/116443/2898/2899/2901/9934/64805/53829/2770/5028/1816/1813/285242/3359/135/5443/153/154/2918/2914/140/3363/3360/6536/6505/1141/1143/2560
#> hsa04080 2922/5368/5027/10874/4828/2904/2906/4543/286530/152/6755/7425/5644/1268/1269/1909/1910/57105/10800/7433/2908/27346/27334/7068/2568/2564/2563/7432/183/283869/796/2149/4923/7434/5729/5021/1902/1511/5737/5179/2903/3001/2690/66004/1132/1129/3352/4157/9340/5029/6751/2557/2892/2893/150/1138/1139/1134/1136/6865/6869/10911/6750/387129/4887/5023/5025/147/146/116443/3814/2898/2899/2901/8862/4886/9934/53829/3953/5028/1816/1813/6752/1241/2147/130576/135/5443/623/624/90226/114131/2641/2151/2150/153/154/2918/2914/53637/3827/10022/5734/5733/5697/6019/140/3363/3360/10316/185/5745/1141/1143/2560/2925/1908/1907
#> hsa04610                                                                                                                                                                                                                                                                                                                                                                                                                             3687/3075/3426/629/5104/1675/5055/2155/3053/1380/1378/2149/2162/2/5345/2243/2244/733/717/730/729/4179/10544/5054/5270/2147/2153/714/713/712/11326/1604/623/624/2151/5648/2161/1191/3827/5328
#> hsa04024                                                                                                                                                                                                                                 814/487/489/477/5881/5567/2904/2906/5906/6755/51/1909/111/108/109/115/64399/7432/810/2149/7434/7137/5602/5021/51196/5139/3776/7409/2903/1129/3352/5337/268/84699/6751/482/2892/2893/6548/493/2737/6662/6750/775/776/10846/116443/5595/4886/2770/1259/1261/1816/1813/6752/64091/64208/135/5443/7074/2641/11149/153/154/486/5348/4881/815/816/5350/5143/5733/627/3360/10398/5144/1908/1907
#> hsa04062                                                                                                                                                                                                                                                              5881/5579/5567/56477/6366/6364/6368/6370/6369/10344/5906/653361/2791/55970/51764/6655/111/108/109/115/1524/2788/2786/2785/5196/54331/5294/6846/6373/3627/10563/6387/2919/2921/2920/6374/6372/7409/1794/5332/3055/729230/1232/1236/1237/10803/23533/6375/408/6348/6355/53358/5595/6356/6357/6358/6359/6363/2185/2770/7074/6352/5473/2826/7454/10235/643/5336
#>          Count
#> hsa04060   116
#> hsa04020    98
#> hsa04820    90
#> hsa04061    49
#> hsa04974    50
#> hsa04082    77
#> hsa04080   118
#> hsa04610    40
#> hsa04024    78
#> hsa04062    68

QUESTION 11a : Y a-t-il une voie enrichie pour le cancer ?

QUESTION 11b : Quels autres pathways cancéreux sont détectés? Le pathway hsa05210 est-il enrichi?

QUESTION 11c : Identifiez-vous des voies de signalisation majeures
(Wnt, MAPK, PI3K…)?

QUESTION 11d : Comment interprétez-vous la présence de “Pathways in cancer”?


11.5 Gene Set Enrichment Analysis (GSEA)

** GSEA vs ORA**

ORA (Over-Representation Analysis) : - Utilise une liste de gènes significatifs (seuil arbitraire) - Test hypergéométrique - Perd l’information sur les gènes “presque” significatifs

GSEA (Gene Set Enrichment Analysis) : - Utilise TOUS les gènes classés par fold change - Pas de seuil arbitraire - Plus de puissance statistique - Détecte des effets faibles mais coordonnés

cat("=== GENE SET ENRICHMENT ANALYSIS (GSEA) ===\n\n")
#> === GENE SET ENRICHMENT ANALYSIS (GSEA) ===
# Préparer la liste rangée
# Tous les gènes avec leur log2FC
gene_list_full <- res$log2FoldChange
names(gene_list_full) <- rownames(res)

# Enlever NA
gene_list_full <- gene_list_full[!is.na(gene_list_full)]

# Convertir en Entrez IDs
gene_list_entrez <- mapIds(org.Hs.eg.db,
                           keys = names(gene_list_full),
                           column = "ENTREZID",
                           keytype = "SYMBOL",
                           multiVals = "first")

# Associer log2FC aux Entrez IDs
gene_list_gsea <- gene_list_full[!is.na(gene_list_entrez)]
names(gene_list_gsea) <- gene_list_entrez[!is.na(gene_list_entrez)]

# Ordonner par log2FC décroissant
gene_list_gsea <- sort(gene_list_gsea, decreasing = TRUE)

cat("Gènes dans la liste GSEA:", length(gene_list_gsea), "\n")
#> Gènes dans la liste GSEA: 13749
cat("Range log2FC:", round(range(gene_list_gsea), 2), "\n\n")
#> Range log2FC: -8.89 10.38
# GSEA sur KEGG
gsea_kegg <- gseKEGG(
  geneList = gene_list_gsea,
  organism = "hsa",
  pvalueCutoff = 0.05,
  pAdjustMethod = "BH",
  verbose = FALSE
)

cat("Pathways GSEA significatifs:", nrow(gsea_kegg), "\n\n")
#> Pathways GSEA significatifs: 85
if(nrow(gsea_kegg) > 0) {
  # Aperçu
  cat("Top pathways GSEA:\n")
  print(head(as.data.frame(gsea_kegg)[, c("Description", "NES", "pvalue", "p.adjust")], 10))
  
  # Dotplot GSEA
  p_gsea <- dotplot(gsea_kegg, showCategory = 20, split = ".sign") +
    facet_grid(.~.sign) +
    ggtitle("GSEA: KEGG Pathways")
  
  print(p_gsea)
  
  ggsave("figures/enrichment/GSEA_KEGG_dotplot.pdf", p_gsea, width = 14, height = 10)
  
  # Sauvegarder
  write.csv(as.data.frame(gsea_kegg),
            "results/enrichment/GSEA_KEGG.csv",
            row.names = FALSE)
}
#> Top pathways GSEA:
#>                                Description       NES          pvalue
#> hsa04382      Cornified envelope formation  2.415821 0.0000000001000
#> hsa04110                        Cell cycle  2.104312 0.0000000208105
#> hsa03008 Ribosome biogenesis in eukaryotes  2.252782 0.0000001297551
#> hsa04022        cGMP-PKG signaling pathway -1.967218 0.0000002595848
#> hsa04082      Neuroactive ligand signaling -1.921698 0.0000009087986
#> hsa03013       Nucleocytoplasmic transport  2.039804 0.0000012881096
#> hsa04024            cAMP signaling pathway -1.865874 0.0000011502711
#> hsa04725               Cholinergic synapse -1.972770 0.0000026689863
#> hsa00982 Drug metabolism - cytochrome P450 -2.063371 0.0000046358938
#> hsa04517                IgSF CAM signaling -1.728426 0.0000046120435
#>                p.adjust
#> hsa04382 0.000000035100
#> hsa04110 0.000003652242
#> hsa03008 0.000015181342
#> hsa04022 0.000022778566
#> hsa04082 0.000063797662
#> hsa03013 0.000064589498
#> hsa04024 0.000064589498
#> hsa04725 0.000117101773
#> hsa00982 0.000162719873
#> hsa04517 0.000162719873

QUESTION 12a : Les pathways enrichis sont-ils similaires entre ORA et GSEA?

QUESTION 12b : GSEA détecte-t-il des pathways que ORA avait manqués?

QUESTION 12c : Qu’est-ce que le NES (Normalized Enrichment Score)?

QUESTION 12d : Quelle méthode préférez-vous et pourquoi?


11.6 Enrichment Plots pour Pathways Spécifiques

if(nrow(gsea_kegg) > 0) {
  # Prendre les 3 pathways les plus significatifs
  top3_pathways <- head(gsea_kegg$Description, 3)
  
  for(pathway in top3_pathways) {
    p_enrich <- gseaplot2(
      gsea_kegg,
      geneSetID = which(gsea_kegg$Description == pathway),
      title = pathway,
      pvalue_table = TRUE
    )
    
    print(p_enrich)
    
    # Sauvegarder
    filename <- paste0("figures/enrichment/GSEA_plot_", 
                      gsub("[^A-Za-z0-9]", "_", pathway), ".pdf")
    ggsave(filename, p_enrich, width = 12, height = 8)
  }
}


11.7 Reactome Pathway Enrichment

cat("=== REACTOME PATHWAY ENRICHMENT ===\n\n")
#> === REACTOME PATHWAY ENRICHMENT ===
reactome_enrich <- enrichPathway(
  gene = genes_entrez,
  organism = "human",
  pvalueCutoff = 0.05,
  readable = TRUE
)

cat("Pathways Reactome enrichis:", nrow(reactome_enrich), "\n\n")
#> Pathways Reactome enrichis: 69
if(nrow(reactome_enrich) > 0) {
  # Aperçu
  cat("Top 10 pathways Reactome:\n")
  print(head(as.data.frame(reactome_enrich), 10))
  
  # Dotplot
  p_reactome <- dotplot(reactome_enrich, showCategory = 20) +
    ggtitle("Reactome Pathway Enrichment")
  
  print(p_reactome)
  
  ggsave("figures/enrichment/Reactome_dotplot.pdf", p_reactome, width = 12, height = 10)
  
  # Sauvegarder
  write.csv(as.data.frame(reactome_enrich),
            "results/enrichment/Reactome_enrichment.csv",
            row.names = FALSE)
}
#> Top 10 pathways Reactome:
#>                          ID
#> R-HSA-1474244 R-HSA-1474244
#> R-HSA-397014   R-HSA-397014
#> R-HSA-5576891 R-HSA-5576891
#> R-HSA-373076   R-HSA-373076
#> R-HSA-1474228 R-HSA-1474228
#> R-HSA-500792   R-HSA-500792
#> R-HSA-1474290 R-HSA-1474290
#> R-HSA-2022090 R-HSA-2022090
#> R-HSA-418594   R-HSA-418594
#> R-HSA-112316   R-HSA-112316
#>                                                                Description
#> R-HSA-1474244                            Extracellular matrix organization
#> R-HSA-397014                                            Muscle contraction
#> R-HSA-5576891                                           Cardiac conduction
#> R-HSA-373076                          Class A/1 (Rhodopsin-like receptors)
#> R-HSA-1474228                      Degradation of the extracellular matrix
#> R-HSA-500792                                           GPCR ligand binding
#> R-HSA-1474290                                           Collagen formation
#> R-HSA-2022090 Assembly of collagen fibrils and other multimeric structures
#> R-HSA-418594                                 G alpha (i) signalling events
#> R-HSA-112316                                               Neuronal System
#>               GeneRatio   BgRatio                     pvalue
#> R-HSA-1474244  128/2292 300/11009 0.000000000000000003568353
#> R-HSA-397014    83/2292 205/11009 0.000000000082367074139901
#> R-HSA-5576891   60/2292 132/11009 0.000000000151986372085149
#> R-HSA-373076   117/2292 335/11009 0.000000000901040190032301
#> R-HSA-1474228   60/2292 140/11009 0.000000002615641747219934
#> R-HSA-500792   149/2292 468/11009 0.000000007625671314125572
#> R-HSA-1474290   43/2292  90/11009 0.000000009503244728464896
#> R-HSA-2022090   33/2292  61/11009 0.000000010210468574276890
#> R-HSA-418594   109/2292 318/11009 0.000000011137069061151775
#> R-HSA-112316   133/2292 410/11009 0.000000014081124847423994
#>                              p.adjust                  qvalue
#> R-HSA-1474244 0.000000000000005402486 0.000000000000004890522
#> R-HSA-397014  0.000000062351875123905 0.000000056443121331659
#> R-HSA-5576891 0.000000076702455778972 0.000000069433774194689
#> R-HSA-373076  0.000000341043711927226 0.000000308724823005804
#> R-HSA-1474228 0.000000792016321058196 0.000000716961169448496
#> R-HSA-500792  0.000001873502506509310 0.000001695960692119253
#> R-HSA-1474290 0.000001873502506509310 0.000001695960692119253
#> R-HSA-2022090 0.000001873502506509310 0.000001695960692119253
#> R-HSA-418594  0.000001873502506509310 0.000001695960692119253
#> R-HSA-112316  0.000002131882301899993 0.000001929855215931162
#>                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                           geneID
#> R-HSA-1474244                                                        COL7A1/ITGA8/ITGA2/ITGA7/COL4A5/COL4A1/ITGAX/ITGAL/NRXN1/DMD/LOX/BMP7/BMP4/BMP2/COL14A1/MMP25/MATN3/PRSS1/ADAM12/COL10A1/COL28A1/TIMP1/COL4A6/CAPN14/CAPN13/CAPN12/PTPRS/LTBP4/COL11A1/COL11A2/LTBP2/JAM3/CD44/COL23A1/TTR/AGRN/DCN/COL1A2/COL27A1/JAM2/BSG/HAPLN1/LAMC2/MFAP2/CTSG/CTSS/ADAMTS2/ADAMTS8/NCAM1/FBLN2/FBLN1/FBLN5/MMP8/MMP9/MMP7/MMP3/SCUBE3/SCUBE1/A2M/COL9A2/COL9A3/COL9A1/MMP1/MFAP5/MFAP4/COL22A1/ITGB3/ITGB7/ITGB8/FGA/FGB/COL5A2/COL5A3/COL5A1/ITGA11/COL8A1/KLK7/DDR2/COL1A1/ACAN/EMILIN3/SPP1/DST/COL17A1/COL12A1/COMP/IBSP/P4HA3/MMP14/MMP10/MMP11/MMP12/MMP13/GDF5/TPSAB1/LAMC3/EFEMP1/PCOLCE2/SERPINE1/LAMA1/SERPINH1/LOXL2/LOXL4/ICAM5/ICAM3/SPARC/MADCAM1/ADAMTS14/ADAMTS18/ADAM8/MUSK/COL3A1/CEACAM1/PLOD3/TLL1/FGF2/NTN4/ADAMTS4/ADAMTS1/COL24A1/COL21A1/CAPN5/CAPN2/CAPN9/TNXB/CMA1/BGN/VCAN
#> R-HSA-397014                                                                                                                                                                                                                                                                                                                                                              ATP2A1/ATP2A3/ATP1A2/CLIC2/TMOD2/DMD/SRI/SCN2B/KCNK9/KCNK5/AKAP9/FKBP1B/TNNT1/KCNA5/KAT2B/DES/SCN3A/SCN3B/NEB/STIM1/DMPK/SLN/KCNIP4/KCNIP3/TPM2/MYH11/TRIM72/TNNI3/CACNB2/CACNB1/MME/SCN1B/SORBS1/KCNK2/KCNK3/GUCY1A2/SCN7A/FGF13/KCNJ11/KCNJ14/NOS1/ATP1B2/SCN9A/SLC8A3/SLC8A2/ATP2B4/CACNG4/CACNA1H/CACNA1C/TMOD1/ITPR1/KCNK10/LMOD1/TPM1/CALD1/GATA4/TNNT2/KCND3/PDE5A/SCN11A/KCNK15/FXYD6/FXYD2/FXYD3/FXYD1/NPR1/CAMK2A/CAMK2B/ACTG2/SLC8A1/SCN4A/SCN4B/ACTA2/PLN/TNNC1/CASQ2/CORIN/MYL6/MYL9/MYLK/ABCC9/RYR1/RYR3
#> R-HSA-5576891                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  ATP2A1/ATP2A3/ATP1A2/CLIC2/SRI/SCN2B/KCNK9/KCNK5/AKAP9/FKBP1B/KCNA5/KAT2B/SCN3A/SCN3B/STIM1/DMPK/SLN/KCNIP4/KCNIP3/TNNI3/CACNB2/CACNB1/MME/SCN1B/KCNK2/KCNK3/SCN7A/FGF13/KCNJ11/KCNJ14/NOS1/ATP1B2/SCN9A/SLC8A3/SLC8A2/ATP2B4/CACNG4/CACNA1C/ITPR1/KCNK10/GATA4/KCND3/SCN11A/KCNK15/FXYD6/FXYD2/FXYD3/FXYD1/NPR1/CAMK2A/CAMK2B/SLC8A1/SCN4A/SCN4B/PLN/CASQ2/CORIN/ABCC9/RYR1/RYR3
#> R-HSA-373076                                                                                                                                                                               GRP/PNOC/NMU/NMB/GPR4/CCL28/CCL21/CCL20/CCL23/CCL25/MTNR1A/ADRA2C/SSTR5/CNR1/CNR2/SAA1/EDNRA/EDNRB/CYSLTR2/CYSLTR1/ECE2/ECE1/CX3CR1/GPR143/P2RY10/PF4/GPR55/AGT/GPR183/NPW/XCL2/F2R/NTSR1/CXCL11/CXCL10/CXCL13/CXCL12/PTGDR/OXTR/LPAR1/CXCL1/CXCL3/CXCL2/CXCL5/CXCL6/PTGFR/PENK/CHRM4/CHRM2/HTR1D/MC1R/P2RY2/SSTR1/ADRA2A/CCR2/CCR3/CCR7/CCR8/CCR9/XCL1/TACR2/TACR1/UTS2/SST/NPSR1/NPY2R/CMKLR1/OPN3/ADRA1B/ADRA1D/CCL3/KISS1/APLN/CCL11/CCL13/CCL19/NPY1R/P2RY14/P2RY12/P2RY13/P2RY1/DRD5/DRD2/SSTR2/LTB4R/F2/ADORA2A/GPR37L1/POMC/BDKRB1/BDKRB2/F2RL2/F2RL1/CCL5/PPBP/ADRB1/ADRB2/CCR10/S1PR5/KNG1/INSL5/PTGER4/PTGER3/PYY/CXCR5/RLN2/ADORA3/HTR7/HTR4/GPR17/GPR18/NMUR1/OXGR1/AGTR1/GRPR/EDN3/EDN2
#> R-HSA-1474228                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    COL7A1/COL4A5/COL4A1/COL14A1/MMP25/PRSS1/COL10A1/TIMP1/COL4A6/CAPN14/CAPN13/CAPN12/COL11A1/COL11A2/CD44/COL23A1/DCN/COL1A2/BSG/LAMC2/CTSG/CTSS/ADAMTS8/MMP8/MMP9/MMP7/MMP3/SCUBE3/SCUBE1/A2M/COL9A2/COL9A3/COL9A1/MMP1/COL5A2/COL5A3/COL5A1/COL8A1/KLK7/COL1A1/ACAN/SPP1/COL17A1/COL12A1/MMP14/MMP10/MMP11/MMP12/MMP13/TPSAB1/ADAMTS18/ADAM8/COL3A1/TLL1/ADAMTS4/ADAMTS1/CAPN5/CAPN2/CAPN9/CMA1
#> R-HSA-500792  GRP/PNOC/NMU/NMB/GPR4/CCL28/CCL21/CCL20/CCL23/CCL25/MTNR1A/ADRA2C/SSTR5/WNT9A/TAS2R38/WNT3/WNT2/CNR1/CNR2/SAA1/EDNRA/EDNRB/GNG11/GNG12/GNG13/CYSLTR2/CYSLTR1/SHH/VIPR1/ECE2/ECE1/CX3CR1/GPR143/P2RY10/GNG7/GNG4/GNG3/PF4/GPR55/VIP/GNG2/AGT/GPR183/NPW/XCL2/CALCA/F2R/NTSR1/VIPR2/FZD3/FZD5/CXCL11/CXCL10/CXCL13/CXCL12/PTGDR/OXTR/LPAR1/CXCL1/CXCL3/CXCL2/CXCL5/CXCL6/PTGFR/PENK/CHRM4/CHRM2/HTR1D/MC1R/GLP2R/P2RY2/SSTR1/ADRA2A/CCR2/CCR3/CCR7/CCR8/CCR9/XCL1/TACR2/TACR1/UTS2/SST/NPSR1/NPY2R/CMKLR1/DHH/OPN3/ADRA1B/ADRA1D/FZD10/CCL3/KISS1/APLN/CCL11/CCL13/CCL19/NPY1R/P2RY14/P2RY12/P2RY13/WNT7B/P2RY1/DRD5/DRD2/WNT5A/SSTR2/LTB4R/F2/ADORA2A/GPR37L1/CD55/POMC/BDKRB1/BDKRB2/WNT2B/UCN2/UCN3/GCG/F2RL2/F2RL1/CCL5/WNT11/PPBP/ADRB1/ADRB2/CCR10/GRM8/GRM4/S1PR5/KNG1/INSL5/PTGER4/PTGER3/PYY/CXCR5/RLN2/ADORA3/HTR7/HTR4/GPR17/GPR18/NMUR1/OXGR1/AGTR1/PTH1R/GRPR/EDN3/EDN2
#> R-HSA-1474290                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                           COL7A1/COL4A5/COL4A1/LOX/COL14A1/COL10A1/COL28A1/COL4A6/COL11A1/COL11A2/COL23A1/COL1A2/COL27A1/LAMC2/CTSS/ADAMTS2/MMP9/MMP7/MMP3/COL9A2/COL9A3/COL9A1/COL22A1/COL5A2/COL5A3/COL5A1/COL8A1/COL1A1/DST/COL17A1/COL12A1/P4HA3/MMP13/PCOLCE2/SERPINH1/LOXL2/LOXL4/ADAMTS14/COL3A1/PLOD3/TLL1/COL24A1/COL21A1
#> R-HSA-2022090                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                         COL7A1/COL4A5/COL4A1/LOX/COL14A1/COL10A1/COL4A6/COL11A1/COL11A2/COL1A2/COL27A1/LAMC2/CTSS/MMP9/MMP7/MMP3/COL9A2/COL9A3/COL9A1/COL5A2/COL5A3/COL5A1/COL8A1/COL1A1/DST/COL17A1/COL12A1/MMP13/LOXL2/LOXL4/COL3A1/TLL1/COL24A1
#> R-HSA-418594                                                                                                                                                                                                                     CAMK4/GPSM2/PNOC/NMU/PRKACB/GNA15/GNA11/CCL28/CCL21/CCL20/CCL23/CCL25/MTNR1A/ADRA2C/SSTR5/GNAL/TAS2R38/CNR1/CNR2/SAA1/GNG11/GNG12/GNG13/ADCY5/ADCY2/ADCY3/ADCY9/CX3CR1/GNG7/GNG4/GNG3/PF4/GPR55/GNG2/AGT/GPR183/NPW/RGS5/RGS9/CXCL11/CXCL10/CXCL13/CXCL12/PDE1C/PDE1B/PDE1A/KPNA2/LPAR1/PRKCG/PRKAR2B/CXCL1/CXCL3/CXCL2/CXCL5/CXCL6/PENK/CHRM4/CHRM2/HTR1D/PLCB4/SSTR1/ADRA2A/CCR2/CCR3/CCR7/CCR8/CCR9/SST/NPY2R/ITPR1/OPN3/GNAZ/APLN/CCL13/CCL19/NPY1R/P2RY14/P2RY12/P2RY13/GNAI1/SSTR2/GPR37L1/NBEA/POMC/BDKRB1/BDKRB2/CCL5/PPBP/CCR10/CAMK2A/CAMK2B/GRM8/GRM4/S1PR5/KNG1/INSL5/PDE4C/PTGER3/PYY/CXCR5/ADORA3/GPR17/GPR18/NMUR1/OXGR1/RGS18/PDE4D/RGS16/RGS13
#> R-HSA-112316          CAMK4/LRRTM1/LRRTM2/NRXN1/CPLX1/NTRK3/PRKCB/GJC1/PRKACB/KCNK9/GRIN2B/GRIN2D/AKAP5/SLC18A2/SLC17A7/KCNAB1/LRRC4B/SYT2/SYT7/PRKAA2/GNAL/EPB41L3/NRXN2/PTPRF/GNG11/GNG12/GNG13/ADCY5/ADCY2/ADCY3/ADCY9/KCNA5/KCNA3/PTPRS/GNG7/GNG4/GNG3/PANX2/BCHE/TUBA1A/GNG2/KCNN3/RPS6KA6/KCNG1/KCNG3/KCNQ4/CACNB2/CACNB1/SLC1A2/HCN1/NAAA/KPNA2/SYN3/PRKCG/PRKAR2B/KCNK2/KCNK3/GRIN2A/VAMP2/NLGN3/NRG1/NLGN1/SLC5A7/TSPAN7/KCNJ11/KCNJ15/KCNJ14/SLC6A1/MAOA/GABRA4/GRIA3/GRIA4/TUBB6/TUBB3/MAPT/CACNA2D1/CACNG4/CHRNA5/CHRNA7/CHRNA1/CHRNA3/KCNH4/KCNH1/KCNH8/CACNA1A/CACNA1E/SNAP25/KCNJ3/KCNMB1/DLG2/GRIN3A/GRIK2/GRIK3/GRIK5/RASGRF1/KCNN4/GRIP2/MAPK3/KCNK10/GNAI1/SYT12/KCNB1/HTR3E/HTR3A/PPFIA4/KIF17/KCND3/NEFL/NBEA/TUBB2B/HOMER1/CAMK2A/CAMK2B/APBA1/STX1A/KCNQ5/GLS2/KCNS3/GAD1/KCNMA1/STXBP1/TUBAL3/NLGN4Y/NLGN4X/SLC1A1/SYN2/CHRNB2/CHRNB4/GABRB1/SLITRK5/SLITRK3/ABCC8/ABCC9
#>               Count
#> R-HSA-1474244   128
#> R-HSA-397014     83
#> R-HSA-5576891    60
#> R-HSA-373076    117
#> R-HSA-1474228    60
#> R-HSA-500792    149
#> R-HSA-1474290    43
#> R-HSA-2022090    33
#> R-HSA-418594    109
#> R-HSA-112316    133


12 Résumé Final et Export

12.1 Tableau Résumé de l’Analyse

# Créer un tableau résumé
summary_stats <- data.frame(
  Statistique = c(
    "Gènes testés",
    "Gènes significatifs (padj < 0.05, |log2FC| > 1)",
    "Sur-exprimés (Tumor > Normal)",
    "Sous-exprimés (Tumor < Normal)",
    "Ratio Up/Down",
    "Variance expliquée PC1 (%)",
    "Variance expliquée PC2 (%)",
    "Termes GO-BP enrichis",
    "Pathways KEGG enrichis (ORA)",
    "Pathways GSEA positifs (NES > 0)",
    "Pathways GSEA négatifs (NES < 0)"
  ),
  Valeur = c(
    nrow(counts_filtered),
    nrow(res_sig),
    nrow(up_regulated),
    nrow(down_regulated),
    round(nrow(up_regulated) / nrow(down_regulated), 2),
    var_explained[1],
    var_explained[2],
    if(exists("ego_bp")) nrow(ego_bp) else 0,
    if(exists("kegg_enrich")) nrow(kegg_enrich) else 0,
    if(exists("gsea_kegg")) sum(gsea_kegg$NES > 0) else 0,
    if(exists("gsea_kegg")) sum(gsea_kegg$NES < 0) else 0
  )
)

cat("\n=== RÉSUMÉ DE L'ANALYSE ===\n\n")
#> 
#> === RÉSUMÉ DE L'ANALYSE ===
print(summary_stats)
#>                                        Statistique   Valeur
#> 1                                     Gènes testés 16023.00
#> 2  Gènes significatifs (padj < 0.05, |log2FC| > 1)  4316.00
#> 3                    Sur-exprimés (Tumor > Normal)  1994.00
#> 4                   Sous-exprimés (Tumor < Normal)  2322.00
#> 5                                    Ratio Up/Down     0.86
#> 6                       Variance expliquée PC1 (%)    20.26
#> 7                       Variance expliquée PC2 (%)    11.06
#> 8                            Termes GO-BP enrichis  1520.00
#> 9                     Pathways KEGG enrichis (ORA)    84.00
#> 10                Pathways GSEA positifs (NES > 0)    25.00
#> 11                Pathways GSEA négatifs (NES < 0)    60.00
# Sauvegarder
write.csv(summary_stats, 
          "results/analysis_summary.csv",
          row.names = FALSE)

cat("\n Résumé sauvegardé: results/analysis_summary.csv\n")
#> 
#>  Résumé sauvegardé: results/analysis_summary.csv

12.2 Top Gènes Candidats pour Validation

cat("\n=== TOP 10 GÈNES CANDIDATS POUR VALIDATION ===\n\n")
#> 
#> === TOP 10 GÈNES CANDIDATS POUR VALIDATION ===
# Critères:
# 1. Très significatif (padj très faible)
# 2. Fold change élevé
# 3. Expression moyenne élevée (pas un gène rare)

res_sig_df_ordered <- res_sig_df[order(res_sig_df$padj), ]

top_candidates <- head(res_sig_df_ordered, 10)

cols_to_show <- intersect(
  c("symbol", "genename", "log2FoldChange", "padj", "baseMean"),
  colnames(top_candidates)
)

print(top_candidates[, cols_to_show])
#>            symbol log2FoldChange
#> CDH3         CDH3       6.096033
#> ETV4         ETV4       5.498884
#> KRT80       KRT80       6.997234
#> FOXQ1       FOXQ1       6.211920
#> KIAA1199 KIAA1199       5.705061
#> UGP2         UGP2      -2.120057
#> CLDN1       CLDN1       4.939233
#> GRIN2D     GRIN2D       4.908921
#> ESM1         ESM1       6.221027
#> BEST4       BEST4      -6.599301
#>                                                                                                                                                                                                                                                                                                         padj
#> CDH3     0.0000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000
#> ETV4     0.0000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000009143815
#> KRT80    0.0000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000001720139383068
#> FOXQ1    0.0000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000046998172458780113056910077151684390628361143171787261962890625000000000000000
#> KIAA1199 0.0000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000018617065247058253654385828745887465629493817687034606933593750000000000000000000000000000000000
#> UGP2     0.0000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000002449595215611143963867213368956754493410699069499969482421875000000000000000000000000000000000000000000
#> CLDN1    0.0000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000360593848461482517045624540674708669030223973095417022705078125000000000000000000000000000000000000000000
#> GRIN2D   0.0000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000021715266369409016919671956413750990577682387083768844604492187500000000000000000000000000000000000000000000000000000000
#> ESM1     0.0000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000006389265246177807756021438390092725967406295239925384521484375000000000000000000000000000000000000000000000000000000000000
#> BEST4    0.0000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000458646794488251987893107086691202312067616730928421020507812500000000000000000000000000000000000000000000000000000000000000
#>           baseMean
#> CDH3     1835.6656
#> ETV4     2202.9720
#> KRT80     828.4886
#> FOXQ1     974.1229
#> KIAA1199 3130.5660
#> UGP2     2851.5609
#> CLDN1    2048.0136
#> GRIN2D    681.6258
#> ESM1      153.5210
#> BEST4     226.5957

12.3 Session Info (Reproductibilité)

# Sauvegarder l'environnement
sink("results/session_info.txt")
cat("=== INFORMATIONS DE SESSION ===\n\n")
cat("Date d'analyse:", as.character(Sys.time()), "\n\n")
cat("Système:\n")
print(Sys.info())
cat("\n\nPackages et versions:\n")
sessionInfo()
sink()

cat("Session info sauvegardée: results/session_info.txt\n")
#> Session info sauvegardée: results/session_info.txt
# Sauvegarder l'espace de travail complet
save.image("results/TD_RNAseq_COAD_workspace.RData")
cat("Workspace sauvegardé: results/TD_RNAseq_COAD_workspace.RData\n")
#> Workspace sauvegardé: results/TD_RNAseq_COAD_workspace.RData
cat("   (Pour recharger: load('results/TD_RNAseq_COAD_workspace.RData'))\n")
#>    (Pour recharger: load('results/TD_RNAseq_COAD_workspace.RData'))

12.4 Perspectives et Validations

12.4.1 Validations Expérimentales

1. Validation transcriptomique - qRT-PCR sur 10-15 gènes candidats - Cohorte indépendante (n=30-50 par groupe) - Contrôles : GAPDH, ACTB, HPRT1

2. Validation protéomique - Western Blot (top 5 gènes up/down) - Immunohistochimie sur TMA (Tissue MicroArray) - Scoring semi-quantitatif H-score

3. Études fonctionnelles - siRNA/CRISPR knockdown - Tests phénotypiques : prolifération, migration, invasion - Xénogreffes souris (in vivo)


12.4.2 Analyses Complémentaires

Pour aller plus loin :

  1. Single-cell RNA-seq : Hétérogénéité intra-tumorale, sous-populations
  2. Spatial transcriptomics : Organisation spatiale des cellules
  3. Multi-omics : Intégration protéomique, métabolomique, épigénomique
  4. Sous-typage CMS : Classification en 4 sous-types moléculaires
  5. Analyse de survie : Association expression-pronostic
  6. Réseaux de co-expression : WGCNA, identification de modules
  7. Déconvolution : Estimation composition cellulaire (CIBERSORT, xCell)

13 Références

13.1 Publications Principales

  1. Cancer Genome Atlas Network. (2012). Comprehensive molecular characterization of human colon and rectal cancer. Nature, 487(7407), 330-337. DOI: 10.1038/nature11252

  2. Love, M. I., Huber, W., & Anders, S. (2014). Moderated estimation of fold change and dispersion for RNA-seq data with DESeq2. Genome Biology, 15(12), 550. DOI: 10.1186/s13059-014-0550-8

  3. Yu, G., Wang, L. G., Han, Y., & He, Q. Y. (2012). clusterProfiler: an R package for comparing biological themes among gene clusters. OMICS, 16(5), 284-287. DOI: 10.1089/omi.2011.0118

  4. Subramanian, A., et al. (2005). Gene set enrichment analysis: a knowledge-based approach for interpreting genome-wide expression profiles. PNAS, 102(43), 15545-15550. DOI: 10.1073/pnas.0506580102

  5. Hanahan, D., & Weinberg, R. A. (2011). Hallmarks of cancer: the next generation. Cell, 144(5), 646-674.

  6. Guinney, J., et al. (2015). The consensus molecular subtypes of colorectal cancer. Nature Medicine, 21(11), 1350-1356.


14 Annexes

14.1 Commandes R Utiles

# Recharger l'espace de travail
load("results/TD_RNAseq_COAD_workspace.RData")

# Réinitialiser le device graphique
dev.off()

# Augmenter la mémoire (Windows)
memory.limit(size = 16000)

# Vérifier les objets en mémoire
ls()

# Taille des objets
sort(sapply(ls(), function(x) object.size(get(x))), decreasing = TRUE)

# Nettoyer la mémoire
rm(list = ls())
gc()

14.2 Interprétation des Statistiques

14.2.1 P-value vs P-value Ajustée (FDR)

Statistique Signification Seuil
p-value Probabilité d’observer par hasard < 0.05
padj (FDR) Proportion faux positifs attendue < 0.05

Exemple : - 100 gènes avec p < 0.05 → 5 faux positifs attendus - 100 gènes avec FDR < 0.05 → ~5 faux positifs parmi les découvertes

14.2.2 Log2 Fold Change

log2FC FC Interprétation
1 Doublement
2 Quadruplement
3 ×8
-1 0.5× Moitié
-2 0.25× Quart

14.3 Glossaire

COAD : Colon Adenocarcinoma
CRC : ColorRectal Cancer
DEG : Differentially Expressed Gene
FDR : False Discovery Rate
FC : Fold Change
VST : Variance Stabilizing Transformation
PCA : Principal Component Analysis
GO : Gene Ontology
KEGG : Kyoto Encyclopedia of Genes and Genomes
ORA : Over-Representation Analysis
GSEA : Gene Set Enrichment Analysis
NES : Normalized Enrichment Score
TCGA : The Cancer Genome Atlas
MSI : MicroSatellite Instability
CMS : Consensus Molecular Subtype